Error with encoding reference genome

standage commented 5 years ago

I'm trying out Arioc for the first time and I'm having trouble with the reference genome encoding. Running AriocE version v1.25.2401.18201 I get the following output while doing the nongapped encoding.

165230676 [000076b9] AriocE v1.25.2401.18201 (release)
165230677 [000076b9] Copyright (c) 2015-2018 Johns Hopkins University.  All rights reserved.
165230677 [000076b9]  data type sizes      : int=4 long=8 *=8
165230677 [000076b9]  executable file      : 
165230679 [000076b9]  configuration file : /path/to/my/projdir/config/encode-refr-nongapped.xml
165230691 [000076b9] encodeR: encoding 43 files (86 sequences) (8 CPU threads available)...
170714855 [000076b9] encodeR: encoded 43 files (86 sequences)
172715421 [000076b9] computeJlistSizes:  log2(nJ)  # J-lists
172715422 [000076b9] computeJlistSizes:         0      10713
172715422 [000076b9] computeJlistSizes:         1     341060
172715422 [000076b9] computeJlistSizes:         2   17427473
172715422 [000076b9] computeJlistSizes:         3  365417880
172715422 [000076b9] computeJlistSizes:         4  560730395
172715422 [000076b9] computeJlistSizes:         5   93906187
172715422 [000076b9] computeJlistSizes:         6   22111708
172715422 [000076b9] computeJlistSizes:         7    7880260
172715422 [000076b9] computeJlistSizes:         8    3306299
172715422 [000076b9] computeJlistSizes:         9    1535308
172715422 [000076b9] computeJlistSizes:        10     705079
172715422 [000076b9] computeJlistSizes:        11     246196
172715422 [000076b9] computeJlistSizes:        12      87600
172715422 [000076b9] computeJlistSizes:        13      28644
172715422 [000076b9] computeJlistSizes:        14       5676
172715422 [000076b9] computeJlistSizes:        15        511
172715422 [000076b9] computeJlistSizes:        16          9
172723704 [000076b9] ApplicationException ([0x30393] AriocE/AriocE.R.cpp 408): computeJlistSizes: too many J values (28464018949) for J table (maximum = 8589934591)

The manual discusses the performance benefits of higher vs lower choices of the J parameter, but it is unclear whether the issue is due to parameter selection or an issue with the software. Please advise.

This is the configuration file I used.

<?xml version="1.0" encoding="utf-8"?>

<AriocE seed="ssi84_2_30" maxDOP="8" maxJ="*">
    <dataIn sequenceType="R" srcId="0" filePath="/path/to/my/projdir/refr" uriPath="https://urgi.versailles.inra.fr/download/iwgsc/IWGSC_RefSeq_Assemblies/v1.0/iwgsc_refseqv1.0_all_chromosomes.zip">
        <file subId="1">Taes_chr1A_part1.fasta</file>
        <file subId="2">Taes_chr1A_part2.fasta</file>
        <file subId="3">Taes_chr1B_part1.fasta</file>
        <file subId="4">Taes_chr1B_part2.fasta</file>
        <file subId="5">Taes_chr1D_part1.fasta</file>
        <file subId="6">Taes_chr1D_part2.fasta</file>
        <file subId="7">Taes_chr2A_part1.fasta</file>
        <file subId="8">Taes_chr2A_part2.fasta</file>
        <file subId="9">Taes_chr2B_part1.fasta</file>
        <file subId="10">Taes_chr2B_part2.fasta</file>
        <file subId="11">Taes_chr2D_part1.fasta</file>
        <file subId="12">Taes_chr2D_part2.fasta</file>
        <file subId="13">Taes_chr3A_part1.fasta</file>
        <file subId="14">Taes_chr3A_part2.fasta</file>
        <file subId="15">Taes_chr3B_part1.fasta</file>
        <file subId="16">Taes_chr3B_part2.fasta</file>
        <file subId="17">Taes_chr3D_part1.fasta</file>
        <file subId="18">Taes_chr3D_part2.fasta</file>
        <file subId="19">Taes_chr4A_part1.fasta</file>
        <file subId="20">Taes_chr4A_part2.fasta</file>
        <file subId="21">Taes_chr4B_part1.fasta</file>
        <file subId="22">Taes_chr4B_part2.fasta</file>
        <file subId="23">Taes_chr4D_part1.fasta</file>
        <file subId="24">Taes_chr4D_part2.fasta</file>
        <file subId="25">Taes_chr5A_part1.fasta</file>
        <file subId="26">Taes_chr5A_part2.fasta</file>
        <file subId="27">Taes_chr5B_part1.fasta</file>
        <file subId="28">Taes_chr5B_part2.fasta</file>
        <file subId="29">Taes_chr5D_part1.fasta</file>
        <file subId="30">Taes_chr5D_part2.fasta</file>
        <file subId="31">Taes_chr6A_part1.fasta</file>
        <file subId="32">Taes_chr6A_part2.fasta</file>
        <file subId="33">Taes_chr6B_part1.fasta</file>
        <file subId="34">Taes_chr6B_part2.fasta</file>
        <file subId="35">Taes_chr6D_part1.fasta</file>
        <file subId="36">Taes_chr6D_part2.fasta</file>
        <file subId="37">Taes_chr7A_part1.fasta</file>
        <file subId="38">Taes_chr7A_part2.fasta</file>
        <file subId="39">Taes_chr7B_part1.fasta</file>
        <file subId="40">Taes_chr7B_part2.fasta</file>
        <file subId="41">Taes_chr7D_part1.fasta</file>
        <file subId="42">Taes_chr7D_part2.fasta</file>
        <file subId="43">Taes_chrUn.fasta</file>
    </dataIn>
    <dataOut>
        <path>/path/to/my/projdir/refr-encoded</path>
    </dataOut>
</AriocE>

RWilton commented 5 years ago

Thank you for letting us know about this problem. We will try to reproduce it and let you know what we see.

RWilton commented 5 years ago

I couldn't figure out the provenance of the Taes files in your .cfg file, but I was able to reproduce the exception in AriocE using the individual "chromosome" files (iwgscrefseqv1.0.fsa) downloaded directly from the INRA website.

I now need to do some troubleshooting. I am not certain if the problem is related to the overall size of the wheat genome, to faulty logic in AriocE, or both. I'll let you know when I have an answer.

Thanks!

standage commented 5 years ago

Thanks for your response!

RWilton commented 5 years ago

Ok, I took a look at that exception: It's thrown when the number of reference-genome loci (i.e., the total number of bases in the double-stranded reference sequence) exceeds 2**33, or 8 gigabases. This is an arbitrary limit because there's an entry in the runtime hash tables (lookup tables) for each locus, so those LUTs consume a lot of memory. For this genome, AriocE counts over 28 billion bases (about 26 gigabases). At run time, the LUTs would occupy something like 475-500GB of memory.

If you have access to a large-memory machine (i.e., sufficient system RAM plus at least one GPU), and you want to take a shot at it, I can relax that 8-gigabase limit and we'll see what happens when we run the aligner. Otherwise, unfortunately, I'm not going to be able to help you until I get my hands on suitable hardware.

I'm sorry I don't have a good answer for you today. But you know how quickly hardware capabilities continue to grow in this business, so ask me again in 6 months!

standage commented 5 years ago

The machine I'm testing this on has 4 NVIDIA Tesla P100s and 1TB of RAM, so I don't anticipate any issues with compute capacity.

Does 8 gigabase limit have to be static/hard-coded, or is it possible to make it dynamically set (and therefore user-configurable)? That would be the ideal case, but in any case yes please let's try with a relaxed limit.

Thanks.

RWilton commented 5 years ago

The most recent Arioc release (v1.30) should support the genome in question. This release is flagged "beta" and will remain so until I can get access to hardware that has sufficient memory resources to validate performance results with this genome. If you get there before me, please let me know what you see!

standage commented 5 years ago

Great! Sorry for my delayed response. I will take a look and let you know.

standage commented 5 years ago

The job has not completed yet, but v1.30 has progressed beyond the stage at which v1.25 failed. So I'm optimistic so far.

RWilton commented 5 years ago

Thanks for the update.

I suspect that AriocE is still going to fail even though some of the compute-intensive list sorting is now offloaded to a GPU. I've been looking at a related problem -- namely, how to get the encoded lookup tables to load more efficiently in the aligner (AriocU or AriocP) -- and I think there is still some code in AriocE that won't scale properly to the larger reference genome. I'll let you know for sure as soon as I determine whether or not that's true...

From: Daniel Standage notifications@github.com Sent: Monday, 1 October, 2018 13:28 To: RWilton/Arioc Arioc@noreply.github.com Cc: Richard Wilton richard.wilton@jhu.edu; State change state_change@noreply.github.com Subject: Re: [RWilton/Arioc] Error with encoding reference genome (#4)

The job has not completed yet, but v1.30 has progressed beyond the stage at which v1.25 failed. So I'm optimistic so far.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/RWilton/Arioc/issues/4#issuecomment-425992953, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH3hmZ7pUHIXmprg3tb463GAtHBQgFhlks5uglC3gaJpZM4WjV4M.

RWilton commented 5 years ago

The current Arioc build (v1.30.2427) has been tested with the bread-wheat reference genome and Illumina 150bp paired-end sequencer data. It looks like memory requirements are about 375GB to encode the IWGSC v1 reference genome and about 280GB to run AriocP with that genome. We'll be doing some additional performance evaluation before we remove the "pre-release" status on this build, but so far it looks stable and delivers well over 100,000 alignments per second per GPU with that genome and 150bp paired-end reads.

standage commented 5 years ago

Great news. Thanks for the update, I'll give it a test drive on my system and let you know.

RWilton / Arioc

Error with encoding reference genome #4