FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
379 stars 101 forks source link

Bismark_genome_preparation fails to make index for large genome (22GB) #368

Closed embertucci closed 4 years ago

embertucci commented 4 years ago

I am currently trying to use the Bismark_genome_preparation script to build a genome index for a RRBS experiment. I have done this successfully in the past on a small genome (750 MB) but the same script is failing when I use it on a much larger genome (22GB). I have re-run the script several time with various amounts of walltime and memory requested, but frequently get incomplete output files with no indication from the standard output/error that Bismark has malfunctioned or run out of memory. In each run I've done, different files seem to be missing or incomplete. For example, in the last run, I requested 400 GB of memory and 100 hours of walltime and Bismark 'completed' after ~24 hours but file 'BS_CT.2.bt2l' was completely empty and the 'BS_CT.rev.1.bt2l' and 'BS_CT.rev.2.bt2l' files are non-existent. In previous iterations, 'BS_GA.2.bt2l' was empty and the 'rev' files missing in the GA conversion folder. It seems like Bismark is quitting at various steps in the process but I cannot figure out why.

I'm re-running with even more memory/RAM requested but curious if you have any insight into what might be going wrong? If it is helpful, I am working with the Pinus taeda genome (Ptaeda2.0) Thank you in advance for your help!

FelixKrueger commented 4 years ago

Hi Emily,

This is certainly one big genome... I have in the past added an option --large_index for when the auto-detection is not working as expected (even though it seems to work for you). We got it to work with the Axolotl genome in the end though, please see here: https://github.com/FelixKrueger/Bismark/issues/251

For the Axolotl, with --parallel 4 (8 cores total, and ~183GB RAM usage) the wallclock time was ~18-20 hours. I can run a test here (I am currently downloading the genome from wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/404/065/GCA_000404065.3_Ptaeda2.0/GCA_000404065.3_Ptaeda2.0_genomic.fna.gz, is that the one you want?). I'll report back to you once I know more (I can probably also make an indexed version of it available for you assuming it doesn't fail....)

embertucci commented 4 years ago

Hi Felix,

Thank you so much for your quick reply. Yes, that is the exact version of the loblolly genome I am working with. Thanks for linking the similar issue with the axolotl genome - my newest submission includes the --parallel 4 option, but given the resources I requested, I probably won't know if this will have worked for ~24-48 hours. I'll update you when that script completes. I appreciate your help with this issue and looking forward to the outcome of your test.

FelixKrueger commented 4 years ago

OK. I'll race you then! :P

FelixKrueger commented 4 years ago

A run with 200GB of memory failed after 50 minutes. The run with 300GB is still running. I just found this on Wikipedia:

Its genome, with 22 billion base pairs, is seven times larger than that of humans.[9][10] As of 2018, assembly of the axolotl genome (32Gb) displaced loblolly pine as the largest assembled genome.[11]

Given that we managed to get the Axolotl genome working, I am now positive that it will also work for the loblolly pine!

FelixKrueger commented 4 years ago

Hi Emily,

The indexing using --parallel 4 has now completed:

Run time 17:43:44, COMPLETED, ExitCode 0

The files seem to be all there:

GA_conversion:
total 81081908
-rw-rw-r-- 1 fkrueger bioinf 22433608920 Sep  1 17:20 genome_mfa.GA_conversion.fa
-rw-rw-r-- 1 fkrueger bioinf  5132911405 Sep  1 17:24 BS_GA.4.bt2l
-rw-rw-r-- 1 fkrueger bioinf    46310715 Sep  1 17:24 BS_GA.3.bt2l
-rw-rw-r-- 1 fkrueger bioinf 10265822820 Sep  2 01:32 BS_GA.2.bt2l
-rw-rw-r-- 1 fkrueger bioinf  6982787841 Sep  2 01:32 BS_GA.1.bt2l
-rw-rw-r-- 1 fkrueger bioinf 10265822820 Sep  2 10:27 BS_GA.rev.2.bt2l
-rw-rw-r-- 1 fkrueger bioinf  6982787841 Sep  2 10:27 BS_GA.rev.1.bt2l

CT_conversion:
total 81082436
-rw-rw-r-- 1 fkrueger bioinf 22433608920 Sep  1 17:20 genome_mfa.CT_conversion.fa
-rw-rw-r-- 1 fkrueger bioinf  5132911405 Sep  1 17:24 BS_CT.4.bt2l
-rw-rw-r-- 1 fkrueger bioinf    46310715 Sep  1 17:24 BS_CT.3.bt2l
-rw-rw-r-- 1 fkrueger bioinf  6982787841 Sep  2 02:13 BS_CT.1.bt2l
-rw-rw-r-- 1 fkrueger bioinf 10265822820 Sep  2 02:13 BS_CT.2.bt2l
-rw-rw-r-- 1 fkrueger bioinf 10265822820 Sep  2 10:44 BS_CT.rev.2.bt2l
-rw-rw-r-- 1 fkrueger bioinf  6982787841 Sep  2 10:44 BS_CT.rev.1.bt2l

So the winning command was:

ssub --email --mem=300G -c8 bismark_genome_preparation --large --parallel 4 --verbose .

Interestingly, without specifying --parallel 4 (so using only 2 instead of 8 cores), the command failed with both 200 and 300 GB of RAM, but seems to be still running with 500 GB of RAM. I would expect the single-core process to at least another 1-2 days...

embertucci commented 4 years ago

Hi Felix,

Awesome! It's interesting that the single core command fails when given less than 500 GB of RAM (I had requested up to 400 GB and had the same failure.) I've got the new command lined up in the queue and am looking forward to replicating this. Thank you so much for your help - I really appreciate it!

Best, Emily

FelixKrueger commented 4 years ago

Just a very brief update, the single-core (well, 2 in fact), also finished:

ssub --email --mem=500G -c2 bismark_genome_preparation --large --verbose .

Run time 1-03:39:25, COMPLETED, ExitCode 0

Interestingly, it didn't take all that much later...

embertucci commented 4 years ago

My script finally completed as well, with a run time of about 17 hours (500 GB, single-core). Thanks again so much for your help!

FelixKrueger commented 4 years ago

Excellent, I'm glad it worked. Next time, remember to pick an organism with a tiny genome :P