FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
389 stars 101 forks source link

Extremely slow bismark bisulfite mapping using bowtie2 and --score_min #106

Closed sansense closed 7 years ago

sansense commented 7 years ago

I am trying to map paired-end reads from an amplicon sequencing using Bismark. The command line is

bismark --un --ambiguous --maxins 1000 --non_directional --score_min L,-0.6,-1 bismark-genome-folder -1 R1.fastq.gz -2 R2.fastq.gz

This is running extremely slow @ around 50000 reads / hour / thread (an estimate from bam-file). I've checked the mapping on three different machines having different specs using: 1) single thread 2) 4 threads and 3) 10 threads. The reaultant speed I'm getting is similar.

I had run the same mapping in single end mode in past (for R1 and R2 separately) and it ran at a reasonable speed. I don't remember the exact speed but at least it was finished in 1-2 days, whereas I'm still waiting for the results of mapping after 3 days (single thread) or 1 day (10 threads).

Next, I did a quick check to see the differences in time by using first 1000 reads of the same (as above) sample

single-end

time bismark --un --ambiguous --non_directional --score_min L,-0.6,-1 bismark-genome-folder  1kr1.fq

real    0m7.883s
user    0m2.523s
sys 0m0.101s

paired-end

time bismark --un --ambiguous --non_directional --score_min L,-0.6,-1  bismark-genome-folder  -1 1kr1.fq -2 1kr2.fq

real    1m25.260s
user    0m8.001s
sys 0m0.141s

This suggests that paired-end mapping is running at least 10 times slower! Has anyone faced similar issues? How to make the mapping faster?

Finer Details: My reference sequences (total ~250) are almost the same (excluding difference of around 30-40 bp in the middle of sequences), length around 800bp each. 150bp, untrimmed reads.

PS: Cross posted on Biostars https://www.biostars.org/p/252276/

FelixKrueger commented 7 years ago

Hi sansense,

I can see several points that can potentially speed this up dramatically, there is no reason the mapping should take very long. Here are my suggestions, in no particular order:

Recommendation:

If I would have to suggest a workflow it would be this:

  1. Trim the reads using Trim Galore, the command is simply: trim_galore --paired R1.fastq.gz R2.fastq.gz
  2. Run Bismark with the following command: bismark --non_directional bismark-genome-folder -1 R1_val1.fq.gz -2 R2_val2.fq.gz

I would assume that this command would align many million reads/hour, and on top of that I would trust the results. Depending on the way your alignments were performed you should later on probably exclude some of the alignment strands from the results (often only the OT and CTOT strands produce the hits you are after, but you can see this from the mapping report). If you still fancy a (near linear) speed increase you could use --parallel if you have enough system resources (in your case this would be mainly CPU dependent because your reference genome should be so small that RAM isn't an issue).

Let me know how you get on. Best, Felix

sansense commented 7 years ago

Hi Felix, First of all, thanks for all your valuable comments, suggestions and feedback. Indeed, the culprit is --score_min L,-0.6,-1 as I was expecting too (that's why it is in the subject line of the Q). Keeping all other options intact and removing just --score_min brings back everything in right order

time bismark --un --ambiguous --non_directional   bismark-genome-folder -1 1kr1.fq -2 1kr2.fq

real    0m5.474s
user    0m3.863s
sys 0m0.100s

My question is: why the --score_min param is affecting the speed of mapping about an order of magnitude only in pe-mode and not in se-mode?

The parameters that I chose was due to some historical reasons. Due to a bad illumina miseq run, we had a suspected poor sequencing for these amplicons in the past. When I tried to map them, the mapping % was awful (may be 10% in pe-mode and 20-30% in se-mode). We were trying to get some preliminary info from these sequences anyway (even if it were not very precise, just to get some idea/trend about our data). So I relaxed the mapping as much and use se-mode to get a decent percentage of mapping. I have some R-scripts which take this methylation extractor data for downstream analysis. Then we ran the same samples on hiseq, and I was just trying to compare the data obtained from miseq and hiseq, keeping the same parameters for mapping and analysis. That correlation was good (~0.9). At this point, I thought to recheck what if I map the hiseq data with pe-mode. Do they correlate with hiseq-se mode in the downstream analysis? And this got me to all this speed problem. BTW, the only good news is that the se-mode and partial pe-mode analysis (using 20% of the data mapped till now) appear to correlate well.

If you run FastQC on the data and see the Illumina adapter (red line) in the adapter content plot show it it is definitely a good idea to trim

Not really. Fastqc is very clean as far as adapter contamination goes

adapter

And the historical reason for not use trimming => when I started using Bismark (and it was my first experience), I was in a hurry to process the data due to which I missed the recommendation of trimming by trim_galore. Later I did see it (when I got a very poor mapping) and ran on a small set of data as a check if it improves mapping. I didn't get any significant improvement in mapping, and thus I continued with untrimmed mapping (I was also afraid that the trimmed reads of unequal lengths might put more problem for me in the downstream analysis as this is a bit involved project).

A simple run through Trim Galore with the command trim_galore --paired R1.fastq.gz R2.fastq.gz will probably result in an amazing speed increase by itself. Lots of mismatches or adapter contamination will cause Bowtie 2 to try much harder than is needed to produce alignments.

I did not remember considering speed as a factor for using trim_galore. For me, more important point at that time was mapping %. BTW, I am now wondering if you have some empirical data showing drastic speedups using trim_galore?

using -X 1000 is normally unnecessary because most bisulfites reads are much shorter than this (typically peaking around 150bp).

This again was a hack as Bismark was complaining something about insert size and was discarding reads (the exact error msg I don't remember now). I'll later check insert_length from bam file as per your suggestion.

You said you tried using more threads for the alignments, did you use -p for this or --parallel(the former --multicore?

I used --multicore and I was not aware of these issue with mt-params. BTW, what is your recommendation to optimally combine ---paralle and -p

I am on version v0.16.0 and I am seeing that a new version is out on web. Are their major changes / performance improvements ?

Thanks again for your patience and help! Best, Santosh

FelixKrueger commented 7 years ago

Hi Santosh,

It's good to hear that you are mostly on track again. I suppose it is just way more computationally intensive to take care of a paired-end alignment, especially when the reference sequences are repetitive and you are allowing a ton of mismatches. Why this is an order of magnitude slower is probably a question for the Bowtie2 people to be honest.

I agree that the sequences don't appear very adapter contaminated, maybe this is owed to the way you actually constructed the libraries in the first place. Generally, as soon as you start adding wrong sequence to the end of reads in the form of read-through adapter contamination or poor quality base calls the mapping will take longer you will then have to start looking into mismatches and gaps etc in order to align a read (pair). It has been a long time since we ran any alignments without prior trimming so I can't really give you any empirical examples, just from the review we wrote in 2012 I remember that aligning untrimmed sequences added a lot of extra time (while still getting low mapping efficiencies). I guess it wouldn't hurt to run a quick test with say a million or so reads with or without trimming to see if and which difference it makes in your setting.

Regarding the multi-threading: we personally mostly run alignments without parallelisation, but this is mainly because we often run dozens of samples at the same time. We've heard that increasing -p beyond 4 does not result in any further speed increase, while --parallel seems to behave more or less linearly. So yea I would probably just use the latter and forget about -p to be honest.

There have been several updates since version 0.16.0 (see here: https://github.com/FelixKrueger/Bismark/blob/master/CHANGELOG.md), so updating to the latest development version is recommended. I am also going to make a new release v0.17.1 in the next few days.

All the best, Felix

sansense commented 7 years ago

Dear Felix, I had been able to solve the pe-mapping problem using your advice (new version 0.18_dev, trimming and removing --score_min). However, I am now facing another problem of very slow methylation extractor in pe-mode. It's running for last 6 hours and is still not finished. Is this usual? Samtools view -c gives about 26million reads in the bam (150bp untrimmed read length). I ran in two ways:

1. ~/utils/Bismark/bismark_methylation_extractor -p --comprehensive --merge_non_CpG --CX --cytosine_report --include_overlap --multicore 16 --samtools_path /software/UHTS/Analysis/samtools/1.3/bin -o s4_trimmed_bismarkNew --genome_folder bismark-index_primerProductsBisABCD_hiseq_NNNNN s4_trimmed_bismarkNew/*.bam > meth.log 2> meth.err

This seems to run forever (>1d) and the culprit seems to be the sort. But I have the context-files generated by this command, so I used the context files to run bismark2bedGraph using--ample_memory option, like this:

2. ~/utils/Bismark/bismark2bedGraph -o s4-cytocine --ample_memory --dir cytosine_reports --CX_context s4_trimmed_bismarkNew/CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt s4_trimmed_bismarkNew/Non_CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt > bismark2bedGraph.log 2> bismark2bedGraph.err

This is running for almost for 6h. It has not reached the sorting stage yet as the last line of log says _"Now writing methylation information for file >>Non_CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2pe.txt<< to individual files for each chromosome"

In single-end mode, I was able to extract the methylation status in about 2 hours for each read (R1 and R2)

Thanks and Regards, Santosh

FelixKrueger commented 7 years ago

Hi Santosh, Generally I would expect anything to do with non-CG information to take a very long time, but since your since your reference sequences are only very short I would expect things to go very fast indeed. So maybe something else is not working as expected then...

Just as a general comment, you normally don't want to use --include_overlap, ever. --multicore16 will speed up the extraction process, but has no influence on the sorting step.

I would probably take it step by step, and first of all lose the option --ample_memory in the bismark2bedGraph command. Just don't redirect all the log files but watch what it happening on screen, it should take more than a few minutes tops until the sorting starts. Another option would be to just try the CpG context files first, this should also be a lot quicker.

Let me know how you get on.

Cheers, Felix

sansense commented 7 years ago

Hi Felix, Thanks again for your help, useful as always!

I am running some tests based on your advice (removing --include_overlap and --ample_memory options). One thing I am not really clear is about the options --buffer_size and --ample_memory. They look like doing the same thing but in a mutually exclusive way. Also --ample_memory comes with the caveat:

Due to overheads in creating and looping through these arrays it seems that it will actually be slower for small files (few million alignments), and we are currently testing at which point it is advisable to use this option.

My Q is: What are the use cases of --buffer_size and --ample_memory? Can these option be combined? Does the use cases of options depend upon amount of memory available w.r.t. the genome size? If you have good amount of memory to spare (~70-100GB), which one option would you prefer?

FelixKrueger commented 7 years ago

By default, bismark2bedGraph uses the UNIX sort command. --buffer_size specifies the amount of memory to be used by sort, in addition to writing out temporary files to disk (to the output directory) if the memory allowed for sorting is exhausted (the default is 2G I believe).

--ample_memory on the other hand uses the Perl internal sort function, which means that the sorting will happen in memory only. Since Perl arrays are not the most memory efficient data structures this might require quite a bit of RAM, but once it is in memory the process itself should be fairly quick again.

With the tiny references you have it shouldn't make much of a difference to be honest, but it almost sounds like there is something fishy going on with the --ample_memory option... Unless you have a specific reason for selecting specific options I would always go for the defaults.

sansense commented 7 years ago

Dear Felix, thanks again for the help and feedback. Here are some updates regarding my tests (extraction only on CpGs).

Test1: using --ample_memory time ~/utils/Bismark/bismark2bedGraph -o s4-cytocine_ampMem --ample_memory --dir tmp_cytosine_reports s4_trimmed_bismarkNew/CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt > ampMem.log 2> ampMem.err

Test2: using --buffer_size 50% time ~/utils/Bismark/bismark2bedGraph -o s4-cytocine_buff50 --buffer_size 50% --dir tmp_cytosine_reports s4_trimmed_bismarkNew/CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt > buff50.log 2> buff50.err

The Test1 finished correctly, while Test2 exited with these errors.

Sorting input file CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt.chremptyVector.methXtractor.temp by positions (using -S of 50%) The temporary inputfile CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt.chremptyVector.methXtractor.temp could not be deleted No such file or directory

Sorting input file CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt.chrFoxD3.wt_A.methXtractor.temp by positions (using -S of 50%) sort: open failed: CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt.chrFoxD3.wt_A.methXtractor.temp: No such file or directory Died at /home/sanand/utils/Bismark/bismark2bedGraph line 486.

The runtime was as follows

real    675m16.639s
user    49m15.804s
sys     2m42.680s

real    677m33.749s
user    150m11.880s
sys     3m1.833s

So my conclusion is that the extraction process is extremely slow in either way. I think that Test2 exited because there was some conflict in the temp files created by Test1 and Test2 as they were being created in the same dir, otherwise Test2 was working fine till sorting stage. Also, this makes me conclude that the Test2 was lagging much behind Test1 as it was working just at sorting stage when Test1 finished the job.

I'll try to run these tests again in a cleaner way (in two separate directory). But as such, it seems extremely slow. What is your comment/feedback regarding this? Note: CpG_context file contains total 180160939 lines (wc -l output)

Some feedback for future release: (Let me know if you want these to be moved to an independent thread.

meth_extractor -o/--output DIR : Allows specification of a different output directory (absolute or relative path). If not specified explicitly, the output will be written to the current directory.

bismark2bedGraph -o/--output : Name of the output file, mandatory. --dir : Output directory. Output is written to the current directory if not specified explicitly.

sansense commented 7 years ago

Updates: I ran three cleaner tests in separate directories Test1: using --ample_memory Test2: using --buffer_size 50% Test3: usning both --ample_memory and --buffer_size 50%

Test1 and Test3 are finished in almost equal times. I was expecting this as turning on --ample_memory will anyway not use --buffer_size (no Unix sorting) - but just wanted to be sure - so this test. Comparing the files of Test1 of this time vs. the last time, they are exactly the same. So I assume that also the last time it ran fine, although took twice the time in previous instance than now!

Test2 is still running....

Test1
real    325m4.278s
user    45m59.528s
sys     1m31.593s

Test3
real    325m12.389s
user    45m48.198s
sys     1m31.459s
FelixKrueger commented 7 years ago

Hi sansense,

Is there any update for the the buffer size 50% option? I was thinking maybe it has to do with the % value you specified? I always use --buffer 10G over here.

Thanks for your general comments above. I know that some of them could be helpful for certain testing scenarios (e.g. the random unique portion of temp filenames or time stamps), but they won't have any immediate effect for actual day to day alignment jobs. I will keep it in mind though if I have a bit of spare time available.

Regarding testing the parameters before the run starts: I have started to make an effort testing most things before the run actually commences to avoid wasting time, so can you remember what it actually was that caused this? I'd be happy to add ahead-of-run checking if that is possible at all. Regarding starting a run from the already converted files: there is currently no option to do this, and in the vast majority of times you will never need to go back to the mapped file a second time. So while this seems like a useful feature I would argue that is also falls into the category of nice things to have, which they would cost me quite some time to implement (especially considering all the different alignment modes with or without the --parallel option...) but eventually they would probably only ever be used in a handful of special testing cases...

And lastly regarding the harmonisation of options: while I have already tried to harmonise the options as much as possible I realise that there are still some cases (e.g. the output directory) that are still at odds. As you can probably imagine this has mostly historic reasons, and if I were to design a suite of tools from scratch again I would almost certainly think of a unified set of option/commands to start with. We have tried to keep the options between Bismark and the methylation extractor the same at least, and under normal circumstances there is no need for users to use bismark2bedGraph or coverage2sytosine individually as they are integrated into the methylation extractor already, so again it would only ever affect a very small number of people....

sansense commented 7 years ago

Hi Felix, The time for --buffer_size 50%

real    427m6.137s
user    152m55.952s
FelixKrueger commented 7 years ago

So the Linux sort command take ~1h 30min longer for the same data set. Is the output equivalent?

sansense commented 7 years ago

Unix cmp says that they are exactly the same

FelixKrueger commented 7 years ago

That's excellent, so it's all working then. Shall we close the issue?

sansense commented 7 years ago

My question was about the extraction time, that seems to be pretty slow. So a full extractions (CpG + Non_CpG) will take almost half day. Is that usual?

FelixKrueger commented 7 years ago

Technically, the extraction process (which comes first) always extracts CpG and non-CG context. This process does many billions of operations, but can be sped up using --parallel. The sorting process which comes afterwards can take a while since you need to keep track of tens of millions (CpG) or even billions (all Cs) of C positions at the same time. A couple of hours or up to a day are quite common indeed. This could probably be optimised by clever parallel sorting, but given that you only do this process once and then never touch it again we think it's worth waiting 2 hours extra... (especially since sample prep to alignment most likely took several weeks or months already...)

Nitin123-4 commented 5 months ago

Hi Team,

I am running Bismark with the below command. I can see it's really slow.

bismark --bowtie2 -N 1 --parallel 4 $RESOURCES2/HG38/ -1 $Read1 -2 $Read2 --output_dir $PWD --temp_dir $PWD/$SAMPLEID"_TEMP" --prefix $SAMPLEID

I have 218,289,382 total reads i.e. 32.96(Gb) data. I did pre processing using Trimmomatic. Filtered reads are used for this. It took ~22 h to complete. Can you please help with this?

Bismark Version: v0.24.0