Not enough memory - Githubissues

jcgrenier commented 7 years ago

Hello there,

Thanks for providing this tool. Do you have any way to know how much memory we would need to do the assembly for files of a particular size? I'm playing with two paired-end samples that were generated on a X10 machine. Each fastq.gz files are less than 40g. If I unzip them, they are of about 160G each. So far, I tried on a fat node containing 512G of memory and it crashed every time at the second step

Performing re-exec to adjust stack size.

Tue May 02 07:52:25 2017 run on cp0302, pid=8127 [Apr 13 2017 11:04:02 R52488 ] DiscovarDeNovo READS="sample:M008 :: \ HGTGYCCXX_8_160403_FR07921224_Other__R_151123_JEFWAL_M008_R{1,2 \ }.fastq" OUT_DIR=Discovar_Denovo NUM_THREADS=48 \ MAX_MEM_GB=500

SYSTEM INFO

OS: Linux :: 2.6.32-642.6.2.el6.x86_64 :: #1 SMP Wed Oct 26 06:52:09 UTC 2016
node name: cp0302
hardware type: x86_64
cache size: 512 KB
cpu MHz: 2200.000
cpu model name: AMD Opteron(tm) Processor 6174
physical memory: 504.75 GB

Omitting memory check. If you run into problems with memory, you might try rerunning with MEMORY_CHECK=True.

Tue May 02 07:52:25 2017: finding input files Tue May 02 07:52:25 2017: reading 2 files (which may take a while)

INPUT FILES: [1a,type=frag,sample=M008,lib=1,frac=1] M008_R1.fastq [1b,type=frag,sample=M008,lib=1,frac=1] M008_R2.fastq

Tue May 02 12:38:11 2017: found 1 samples Tue May 02 12:38:11 2017: starts = 0 Tue May 02 13:37:30 2017: using 964,997,086 reads Tue May 02 13:37:31 2017: data extraction complete, peak mem = 375.88 GB 5.75 hours used extracting reads Tue May 02 13:37:46 2017: see total physical memory of 541,975,564,288 bytes Tue May 02 13:37:46 2017: see user-imposed limit on memory of 536,870,912,000 bytes Tue May 02 13:37:46 2017: 3.74 bytes per read base, assuming max memory available We need 46 passes. Expect 1343834 keys per batch. Provide 1517886 keys per batch. There were 21 buffer overflows.

Fatal error (pid=8127) at Tue May 02 18:25:36 2017: Insufficient memory.

Tue May 02 18:25:36 2017. Abort. Stopping.

Generating a backtrace...

Dump of stack:

CRD::exit(int), in Exit.cc:30
run, in MapReduceEngine.h:408
(...), in BuildReadQGraph.cc:179
buildReadQGraph(...), in BuildReadQGraph.cc:1311
GapToyCore(int, char**), in GapToyCore.cc:584
main, in DiscovarDeNovo.cc:43

I didn't try on the trimmed files so far, but I guess it won't work with the settings I have presently.

Another question too, is there a way to combine two samples together? The only way I thought about for the moment is concatenating the fastq files together, but it could create some issues with the library characteristics right?

Thanks a lot for your help

JC

ljyanesm commented 7 years ago

Hi! Looks like a DISCOVAR log. Could you share which version of w2rap and how are you running it? If you’re running DISCOVAR I am afraid we won’t be able to help you, but you can always give w2rap-contigger a go.

jcgrenier commented 7 years ago

Oh, Ah yes that's right. I tried so many different things so far with my dataset. I don't find the log back unfortunately, I would need to rerun it again. I know that the first step went well, and ran for about 3 hours for loading the dataset. Then, it crashed at the K-mer step because of the memory.

I will regenerate it, but if you know for my files, that are not PCR-free libraries by the way, and 2x150bp, the amount of memory needed, it would be really helpful!

Thanks a lot.

JC

bjclavijo commented 7 years ago

We can't really guesstimate, but as a useful hint, disk batches will decrease the need of memory for step 2. Usually 16 disk batches will be fine.

ljyanesm commented 7 years ago

Hi,

Have a look at using the "-d" flag for step_2 which should reduce the amount of memory needed. It will count the kmers in batches using the disk to store tmp hashes.

Try with 16 batches and if it still fails increase that number.

You can run from step_2 onwards by using the --from_step 2 so it doesn't repeat step_1.

Best,

jcgrenier commented 7 years ago

Hi @ljyanesm,

Where can I recover the temporary files coming from step1? I was launching that on a compute node, but it crashed. Was it in memory? Can I save them in a temporary folder? Thanks for your help. JC

ljyanesm commented 7 years ago

The files should be in the output directory, they are named pe_data.fastb and pe_data.cqual if the --dump_all flag was used.

EDIT: dump_all flag is required to get the intermediate files.

jcgrenier commented 7 years ago

Awesome, thanks @ljyanesm !

jcgrenier commented 7 years ago

Hello @ljyanesm,

Sorry to reopen than topic, but it seems that I'm getting some problems regarding the same thing.

I will ask the question first and then put the log of my run. So I ran w2rap-contigger step by step, keeping all the temporary files, in order to make sure that I could resume in case it crashes.

I was making it run with the splitting on disk option of 16 (-d 16) but it appeared to be not enough. It ran well for 1 sample, but now I'm merging 2 samples together, so having twice the number of reads. So I specified -d 30. It seems to run well, but crashed close to the end of the step 2, at the merging step.

Here's the log : ~/Programs/w2rap-contigger/bin/w2rap-contigger -t 48 -m 500 -r M007-M008_R1.trimmed.fq.gz,M007-M008_R2.trimmed.fq.gz -o contigs -p m_k200_trimmed -d 30 --dump_all 1 --from_step 2 --to_step 2

Welcome to w2rap-contigger WARNING: you are running the code with omp_proc_bind_false, parallel performance may suffer Loading reads in fastb/qualp format... DONE! --== Step 2: Building first (small K) graph ==-- Tue May 30 03:01:17 2017: creating kmers from reads... Tue May 30 03:01:17 2017: disk-based kmer counting with 30 batches Tue May 30 04:45:06 2017: batch 0 done and dumped with 2703840497 kmers Tue May 30 06:11:17 2017: batch 1 done and dumped with 2675797008 kmers Tue May 30 07:31:03 2017: batch 2 done and dumped with 2720373598 kmers Tue May 30 08:03:30 2017: batch 3 done and dumped with 2872143701 kmers Tue May 30 09:19:06 2017: batch 4 done and dumped with 2489537761 kmers Tue May 30 09:45:26 2017: batch 5 done and dumped with 2477138396 kmers Tue May 30 10:25:39 2017: batch 6 done and dumped with 2506687754 kmers Tue May 30 11:43:34 2017: batch 7 done and dumped with 2588948991 kmers Tue May 30 13:07:47 2017: batch 8 done and dumped with 2582684623 kmers Tue May 30 14:31:09 2017: batch 9 done and dumped with 2641273776 kmers Tue May 30 15:59:46 2017: batch 10 done and dumped with 2768253865 kmers Tue May 30 16:26:02 2017: batch 11 done and dumped with 2649639881 kmers Tue May 30 16:50:44 2017: batch 12 done and dumped with 2563868736 kmers Tue May 30 18:12:00 2017: batch 13 done and dumped with 2946466737 kmers Tue May 30 19:15:46 2017: batch 14 done and dumped with 3654065673 kmers Tue May 30 20:37:30 2017: batch 15 done and dumped with 2929339080 kmers Tue May 30 21:50:28 2017: batch 16 done and dumped with 2862097653 kmers Tue May 30 23:09:32 2017: batch 17 done and dumped with 2867445244 kmers Wed May 31 00:28:52 2017: batch 18 done and dumped with 2904585817 kmers Wed May 31 00:54:30 2017: batch 19 done and dumped with 2919453621 kmers Wed May 31 02:13:44 2017: batch 20 done and dumped with 2841856618 kmers Wed May 31 03:32:56 2017: batch 21 done and dumped with 2864796621 kmers Wed May 31 04:52:31 2017: batch 22 done and dumped with 2901847126 kmers Wed May 31 06:22:51 2017: batch 23 done and dumped with 3261754449 kmers Wed May 31 07:49:59 2017: batch 24 done and dumped with 3205541894 kmers Wed May 31 09:14:50 2017: batch 25 done and dumped with 3067587988 kmers Wed May 31 10:00:45 2017: batch 26 done and dumped with 2955175947 kmers Wed May 31 11:25:03 2017: batch 27 done and dumped with 2819848661 kmers Wed May 31 12:49:44 2017: batch 28 done and dumped with 2849583374 kmers Wed May 31 14:14:36 2017: batch 29 done and dumped with 2881152158 kmers Wed May 31 14:14:37 2017: merging from disk terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc w2rap-contigger.M008-007.sh: line 1: 31410 Aborted (core dumped) ~/Programs/w2rap-contigger/bin/w2rap-contigger -t 48 -m 500 -r M007-M008_R1.trimmed.fq.gz,M007-M008_R2. trimmed.fq.gz -o contigs -p meerkat_k200_trimmed -d 30 --dump_all 1 --from_step 2 --to_step 2

I was working on a 512Gb FAT node, but it went up to 521Gb, so the node killed the process, even if I specified -m 500. Will it run well if I'm specifying -m 450 for example? Or will it go over it?

And another question, is it possible to start the process from the merging part?

Thanks for your help. JC

bjclavijo commented 7 years ago

Hi JC, if you’re adding 2 samples together you should expect: 1) More true kmers (from all the regions that are not equal at K=60, you can check that with KAT if you want). This will increase the size of each individual batch and the size of the final combined file. If you check your batch files’s sizes you will notice that even if the ammount of processed reads will be more or less the same as in the single-sample run with 1/2 the data and 1/2 the batches, the batch files themselves will be bigger. 2) In regions where similarity is high enough to produce the same kmers, it will also be enough to double the chances of the errors generated from that portion. This means more kmers will pass the frequency filter.

1 and 2 will combine to make the number of kmers in the k=60 graph larger than in the single-sample case. This is what is using more memory. You can try to stop errors from creeping in by increasing the min_freq (with a possible tradeof on contiguity lager on), but you can’t do anything about truly different content between samples. Again, using KAT comp at K=60 (or even at smaller K to start with) will help you understand how much new content will be coming from using 2 samples rather than 1.

Disregard the ‘-m’ option, it does nothing on this step as of now.

Best,

bj

On 1 Jun 2017, at 15:26, Jean-Christophe Grenier notifications@github.com wrote:

Hello @ljyanesm https://github.com/ljyanesm,

Sorry to reopen than topic, but it seems that I'm getting some problems regarding the same thing.

I will ask the question first and then put the log of my run. So I ran w2rap-contigger step by step, keeping all the temporary files, in order to make sure that I could resume in case it crashes.

I was making it run with the splitting on disk option of 16 (-d 16) but it appeared to be not enough. It ran well for 1 sample, but now I'm merging 2 samples together, so having twice the number of reads. So I specified -d 30. It seems to run well, but crashed close to the end of the step 2, at the merging step.

Here's the log : ~/Programs/w2rap-contigger/bin/w2rap-contigger -t 48 -m 500 -r M007-M008_R1.trimmed.fq.gz,M007-M008_R2.trimmed.fq.gz -o contigs -p m_k200_trimmed -d 30 --dump_all 1 --from_step 2 --to_step 2

Welcome to w2rap-contigger WARNING: you are running the code with omp_proc_bind_false, parallel performance may suffer Loading reads in fastb/qualp format... DONE! --== Step 2: Building first (small K) graph ==-- Tue May 30 03:01:17 2017: creating kmers from reads... Tue May 30 03:01:17 2017: disk-based kmer counting with 30 batches Tue May 30 04:45:06 2017: batch 0 done and dumped with 2703840497 kmers Tue May 30 06:11:17 2017: batch 1 done and dumped with 2675797008 kmers Tue May 30 07:31:03 2017: batch 2 done and dumped with 2720373598 kmers Tue May 30 08:03:30 2017: batch 3 done and dumped with 2872143701 kmers Tue May 30 09:19:06 2017: batch 4 done and dumped with 2489537761 kmers Tue May 30 09:45:26 2017: batch 5 done and dumped with 2477138396 kmers Tue May 30 10:25:39 2017: batch 6 done and dumped with 2506687754 kmers Tue May 30 11:43:34 2017: batch 7 done and dumped with 2588948991 kmers Tue May 30 13:07:47 2017: batch 8 done and dumped with 2582684623 kmers Tue May 30 14:31:09 2017: batch 9 done and dumped with 2641273776 kmers Tue May 30 15:59:46 2017: batch 10 done and dumped with 2768253865 kmers Tue May 30 16:26:02 2017: batch 11 done and dumped with 2649639881 kmers Tue May 30 16:50:44 2017: batch 12 done and dumped with 2563868736 kmers Tue May 30 18:12:00 2017: batch 13 done and dumped with 2946466737 kmers Tue May 30 19:15:46 2017: batch 14 done and dumped with 3654065673 kmers Tue May 30 20:37:30 2017: batch 15 done and dumped with 2929339080 kmers Tue May 30 21:50:28 2017: batch 16 done and dumped with 2862097653 kmers Tue May 30 23:09:32 2017: batch 17 done and dumped with 2867445244 kmers Wed May 31 00:28:52 2017: batch 18 done and dumped with 2904585817 kmers Wed May 31 00:54:30 2017: batch 19 done and dumped with 2919453621 kmers Wed May 31 02:13:44 2017: batch 20 done and dumped with 2841856618 kmers Wed May 31 03:32:56 2017: batch 21 done and dumped with 2864796621 kmers Wed May 31 04:52:31 2017: batch 22 done and dumped with 2901847126 kmers Wed May 31 06:22:51 2017: batch 23 done and dumped with 3261754449 kmers Wed May 31 07:49:59 2017: batch 24 done and dumped with 3205541894 kmers Wed May 31 09:14:50 2017: batch 25 done and dumped with 3067587988 kmers Wed May 31 10:00:45 2017: batch 26 done and dumped with 2955175947 kmers Wed May 31 11:25:03 2017: batch 27 done and dumped with 2819848661 kmers Wed May 31 12:49:44 2017: batch 28 done and dumped with 2849583374 kmers Wed May 31 14:14:36 2017: batch 29 done and dumped with 2881152158 kmers Wed May 31 14:14:37 2017: merging from disk terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc w2rap-contigger.M008-007.sh: line 1: 31410 Aborted (core dumped) ~/Programs/w2rap-contigger/bin/w2rap-contigger -t 48 -m 500 -r M007-M008_R1.trimmed.fq.gz,M007-M008_R2. trimmed.fq.gz -o contigs -p meerkat_k200_trimmed -d 30 --dump_all 1 --from_step 2 --to_step 2

I was working on a 512Gb FAT node, but it went up to 521Gb, so the node killed the process, even if I specified -m 500. Will it run well if I'm specifying -m 450 for example? Or will it go over it?

Thanks for your help. JC

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bioinfologics/w2rap-contigger/issues/27#issuecomment-305509623, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsIDa-9TJcCw6BB8q552EIfG_HbEEMJks5r_sn7gaJpZM4NT39e.

jcgrenier commented 7 years ago

Hi @bjclavijo,

So for now, playing with the min_freq is the only solution to reduce the memory usage at the step 2 right? So it will include less small kmers in the analysis?

Thanks for your help and for responding so quickly.

JC

bjclavijo commented 7 years ago

Yes, but do analyse what you want to get out of this, putting multiple samples together in the w2rap-contigger does sound like a strange thing to do, if you really need to do that, do a KAT spectra comparisson first and know what you’re getting into.

Best,

bj

On 1 Jun 2017, at 18:06, Jean-Christophe Grenier notifications@github.com wrote:

Hi @bjclavijo https://github.com/bjclavijo,

So for now, playing with the min_freq is the only solution to reduce the memory usage at the step 2 right? So it will include less small kmers in the analysis?

Thanks for your help and for responding so quickly.

JC

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bioinfologics/w2rap-contigger/issues/27#issuecomment-305557612, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsIDZCfVoW_sSUCif7k-5_DkU7eVa0bks5r_u97gaJpZM4NT39e.

jcgrenier commented 7 years ago

In reality, it is the same individual but did over two lanes. So, in this case it's probably ok to proceed like this, increasing the representation diversity I guess.

Thanks!

bjclavijo commented 7 years ago

In that case if it worked before, then you don’t expect too many new kmers in (except errors), just increase the min_freq parameter. If it worked before you must be very close to being ok in memory (at this stage, later stages will need to load reads and paths and that will be more or less linear vs. the read files).

Best,

bj

On 1 Jun 2017, at 18:12, Jean-Christophe Grenier notifications@github.com wrote:

In reality, it is the same individual but did over two lanes. So, in this case it's probably ok to proceed like this, increasing the representation diversity I guess.

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bioinfologics/w2rap-contigger/issues/27#issuecomment-305559305, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsIDW2QZbaXWm5E5xLLTQFc3FdI7nL7ks5r_vEAgaJpZM4NT39e.

shamshad1987 commented 5 years ago

Hello All, Is there any pause option in this program? For example, if we are running it on a supercomputer where any job can run only for 24 hours, so it can be paused and resumed from where it was paused. Thanks

jonwright99 commented 5 years ago

You can run one step at a time with the --from_step and --to_step options. Individual steps may still take more than 24 hours if your genome is large.

bjclavijo commented 5 years ago

Closing this as it seems to be solved.

bioinfologics / w2rap-contigger

Not enough memory #27