dzerbino / oases

De novo transcriptome assembler for short reads
www.ebi.ac.uk/~zerbino/oases/
GNU General Public License v3.0
62 stars 9 forks source link

velvetg trying to allocate more than 2 PB of RAM! #12

Open cladobranch opened 7 years ago

cladobranch commented 7 years ago

Hello everybody,

I'm currently testing Oases on our RNA-Seq data (approx. 40 – 80 million paired–end Illumina reads with 150 bp length) and for some reason velvetg tries to allocate huge amounts of memory.

The error message I'm getting reads like this:

velvetg: Can't calloc 281474976710656 void*s totalling 2251799813685248 bytes: Cannot allocate memory Traceback (most recent call last): File "/var/data/tools/oases/0.2.09/scripts/oases_pipeline.py", line 130, in main() File "/var/data/tools/oases/0.2.09/scripts/oases_pipeline.py", line 124, in main singleKAssemblies(options) File "/var/data/tools/oases/0.2.09/scripts/oases_pipeline.py", line 52, in singleKAssemblies assert p.returncode == 0, "Velvetg failed at k = %i\n%s" % (k, output[0]) AssertionError: Velvetg failed at k = 21 [0.000000] Reading roadmap file /var/data/dkarmeinski/Transcriptomes/Assemblies/Oases/01_Dendronotus_orientalis_21/Roadmaps [132.015252] 41896868 roadmaps read

Do you have any ideas what could be going wrong?

Cheers, Dario

SchulzLab commented 7 years ago

Hi Dario, can you please post the call with the oases pipeline script here, this seems odd.

Just out of curiosity how much memory is available on the machine where you ran oases? It is hard to estimate the memory you need, but my guess is that you should have 200 GB to be on the save side. However, if that is a very polymorphic sample with a large transcriptome you may need more memory.

In general there are two ways to reduce the memory taken by the assembler: 1) use ORNA to normalize a dataset, this is our own software that works in the spirit of Diginorm. It removes redundant reads from the sample but also may hurt the final assembly result slightly. I recommend a conservative reduction with log base of 1.7 as a parameter for running ORNA. 2) Error correction with SEECER normally also reduces the memory taken by an assembler, but that does not reduce as much as the first suggestion, but may still be interesting to improve the quality of your assembly.

Hope that helps, Marcel

cladobranch commented 7 years ago

Dear Marcel,

thank you very much for your quick answer!

This is the call I used for executing the pipeline script: python /var/data/tools/oases/0.2.09/scripts/oases_pipeline.py \ -d " -fastq -shortPaired $IN/$i/R1_pd.fastq $IN/$i/R2_pd.fastq " -m 21 -M 35 \ -o $OUT/$i -p " -ins_length 500 -min_trans_lgth 100 " -c \ --merge=27

I also set the OMP_THREAD_LIMIT variable to 12 to enable Oases to use 12 CPUs.

We have 2 TB of RAM on the machine I have been using, so theoretically memory shouldn’t be an issue at all.

Seeing the sheer amount of memory Oases was trying to allocate I immediately thought something must have gone wrong during installation or that I made a mistake while using it, especially since other assemblers only required between 100 GB and 500 GB of memory for the same dataset.

Do you think this kind of problem could be helped with read normalisation?

Thanks in advance for any help / suggestions! Cheers, Dario

Am 10.03.2017 um 16:24 schrieb SchulzLab notifications@github.com:

Hi Dario, can you please post the call with the oases pipeline script here, this seems odd.

Just out of curiosity how much memory is available on the machine where you ran oases? It is hard to estimate the memory you need, but my guess is that you should have 200 GB to be on the save side. However, if that is a very polymorphic sample with a large transcriptome you may need more memory.

In general there are two ways to reduce the memory taken by the assembler:

use ORNA https://github.com/SchulzLab/ORNA to normalize a dataset, this is our own software that works in the spirit of Diginorm http://ged.msu.edu/papers/2012-diginorm/. It removes redundant reads from the sample but also may hurt the final assembly result slightly. I recommend a conservative reduction with log base of 1.7 as a parameter for running ORNA. Error correction with SEECER http://sb.cs.cmu.edu/seecer/ normally also reduces the memory taken by an assembler, but that does not reduce as much as the first suggestion, but may still be interesting to improve the quality of your assembly. Hope that helps, Marcel

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dzerbino/oases/issues/12#issuecomment-285696721, or mute the thread https://github.com/notifications/unsubscribe-auth/AZCSl_R1WXLXRvdz4pkYTAn56zFXenkZks5rkWsWgaJpZM4MVgIk.

SchulzLab commented 7 years ago

Dear Dario, I was wondering if you had some unusual flags set during compilation. For example the -LONGSEQUENCES, BIGASSEMBLY flags should be turned of , but that is the case by default. Also when you set the MAXKMERLENGTH for compilation only use the max kmer value that you use (here 35), or close to that. Because Velvet/oases may otherwise use a very space-consuming representation of the kmers.

Other than that, I am also very puzzled that Oases would take that much memory, I have never seen this in practice, in particular not with such a relatively small read number (~41 Millions). Still Oases is very memory hungry in comparison to many other assemblers, for example SoapDenovo-trans. If the aforementioned approaches do not fix the issue, you should use read normalization with a threshold that does not discard most of the data.

cladobranch commented 7 years ago

Dear Marcel,

we have re-compiled Oases with the following settings: CATEGORIES = 2 MAXKMERLENGTH = 35 OPENMP

Unfortunately we are still facing the same problem. The error message I am getting reads like this:

velvetg: Can't calloc 281474976710656 void*s totalling 2251799813685248 bytes: Cannot allocate memory Traceback (most recent call last): File "/var/data/tools/oases/0.2.09/scripts/oases_pipeline.py", line 130, in main() File "/var/data/tools/oases/0.2.09/scripts/oases_pipeline.py", line 124, in main singleKAssemblies(options) File "/var/data/tools/oases/0.2.09/scripts/oases_pipeline.py", line 52, in singleKAssemblies assert p.returncode == 0, "Velvetg failed at k = %i\n%s" % (k, output[0]) AssertionError: Velvetg failed at k = 21 [0.000000] Reading roadmap file /var/data/dkarmeinski/Transcriptomes/Assemblies/Oases/01_Dendronotus_orientalis_21/Roadmaps [131.941844] 41896868 roadmaps read

Is it possible that the program requires a certain version if C++ and / or Python?

Cheers, Dario

Am 13.03.2017 um 13:20 schrieb SchulzLab notifications@github.com:

Dear Dario, I was wondering if you had some unusual flags set during compilation. For example the -LONGSEQUENCES, BIGASSEMBLY flags should be turned of , but that is the case by default. Also when you set the MAXKMERLENGTH for compilation only use the max kmer value that you use (here 35), or close to that. Because Velvet/oases may otherwise use a very space-consuming representation of the kmers.

Other than that, I am also very puzzled that Oases would take that much memory, I have never seen this in practice, in particular not with such a relatively small read number (~41 Millions). Still Oases is very memory hungry in comparison to many other assemblers, for example SoapDenovo-trans. If the aforementioned approaches do not fix the issue, you should use read normalization with a threshold that does not discard most of the data.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dzerbino/oases/issues/12#issuecomment-286091939, or mute the thread https://github.com/notifications/unsubscribe-auth/AZCSl_4RfxKRoH9iuSxWKXc3Jgjdi0peks5rlTSrgaJpZM4MVgIk.

SchulzLab commented 7 years ago

Hi Dario, I don't think it has to do with the C version. I would suggest to use read normalization.

Regards, Marcel

minor7b5 commented 7 years ago

Dear Marcel,

I've also hit similar issues (currently and in the past). I agree the issue is most likely due to non-normalisation of the data as it works no problem with smaller datasets, so thanks for recommending some tools. Dario, did you find this worked for you?

May I ask if either of you have any recommendations for the options to apply to ORNA?

Many thanks, Ali

MarcelS commented 7 years ago

Dear Ali,

the way we use ORNA is to set as k-mer the smallest k-mer size one uses in the assembly (because ORNA preserves k-mer connectivity in the de Bruijn graph). Then use a threshold value around 1.3 as this normally leads to significant reductions (50-70% depending on dataset complexity) and thus eases memory and runtime problems. If one has a super large dataset, say 600 mio reads or so, it may be worth trying out more stringent thresholds for ORNA to reduce more.

Hope this helps,

Marcel

Am 01.12.17 um 04:19 schrieb minor7b5:

Dear Marcel,

I've also hit similar issues (currently and in the past). I agree the issue is most likely due to non-normalisation of the data as it works no problem with smaller datasets, so thanks for recommending some tools. Dario, did you find this worked for you?

May I ask if either of you have any recommendations for the options to apply to ORNA?

Many thanks, Ali

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dzerbino/oases/issues/12#issuecomment-348390638, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMPQ4_imNrOiXV22AT4Ui4SoVBGhR8Hks5s73BGgaJpZM4MVgIk.

--

Marcel H. Schulz High-throughput Genomics & Systems Biology Cluster of Excellence Multimodal Computing and Interaction and Max Planck Institute for Informatics Saarland Informatics Campus E1.4, Room 515 Saarland University 66123 Saarbruecken, Germany Tel: +49 681 93253115 http://hgsb.mpi-inf.mpg.de http://www.mmci.uni-saarland.de/en/irg/hgsb email: mschulz@mmci.uni-saarland.de


minor7b5 commented 6 years ago

Thanks for the advice - I have used ORNA and successfully ran some velvet/oases assemblies.

ksil91 commented 6 years ago

Hello,

I came across this thread because I have the same issue with my 100bp PE RNASeq data, with velvet requiring insanely high RAM for some kmer values (35, 45,55) but not higher values. I ran ORNA as suggested on my paired end reads, but it outputs a single fasta file with both right and left reads. Can velvet accept this file format, or do I have to split up the fasta file into 2 files?

Also, any idea why velvet would work for lower kmer values but not higher? I should also say that data are pooled from multiple individuals.

SchulzLab commented 6 years ago

Hi Katherine, velvet accepts this interleaved paired-end format as mentioned in the documentation (see here https://www.ebi.ac.uk/~zerbino/velvet/Manual.pdf). It should work fine you don't need to split up into 2 files.

Hope that helps, Marcel

SchulzLab commented 6 years ago

About the question why higher kmer values take too much RAM. The higher k, the larger the number of unique kmers that form unique nodes in the graph. At higher kmer values, presumably too many nodes and their connecting data structures are being created and blow up the memory.