Closed sansense closed 7 years ago
Hi sansense,
I can see several points that can potentially speed this up dramatically, there is no reason the mapping should take very long. Here are my suggestions, in no particular order:
trim_galore --paired R1.fastq.gz R2.fastq.gz
will probably result in an amazing speed increase by itself. Lots of mismatches or adapter contamination will cause Bowtie 2 to try much harder than is needed to produce alignments. Apart from that you might see a lot of incorrect methylation calls and a much reduced mapping efficiency if you your reads were adapter contaminated. If you run FastQC on the data and see the Illumina adapter (red line) in the adapter content plot show it it is definitely a good idea to trim (and I would expect 150bp long sequences to be heavily affected).-X 1000
is normally unnecessary because most bisulfites reads are much shorter than this (typically peaking around 150bp). To check the insert distribution you could for example import your BAM file into SeqMonk and look at the Read Length histogram. If all reads are much shorter than 500 (the default) you don't need to change it. Allowing much longer reads will also take considerably longer to find good alignments--score_min L,-0.6,-1
. This is probably the second biggest problem (the first one being using untrimmed reads). Using relaxed parameters like this will allow penalties of 150 per read, or in other words 50(!) non-bisulfite mismatches per read pair, or a comparable amount of gaps, and this doesn't even include Cs or Ts. In other words, because your reference sequences seem to be very similar apart from the middle sequences, the alignments are trying super hard to find any good hit amongst many ambiguous mappings. Undoubtedly this will take a very long time, and I am not even sure I would trust the results.-p
for this or --parallel
(the former --multicore
? I would assume that increasing -p
has hardly any effect, but setting --parallel 2
should already half the necessary time.If I would have to suggest a workflow it would be this:
trim_galore --paired R1.fastq.gz R2.fastq.gz
bismark --non_directional bismark-genome-folder -1 R1_val1.fq.gz -2 R2_val2.fq.gz
I would assume that this command would align many million reads/hour, and on top of that I would trust the results. Depending on the way your alignments were performed you should later on probably exclude some of the alignment strands from the results (often only the OT and CTOT strands produce the hits you are after, but you can see this from the mapping report). If you still fancy a (near linear) speed increase you could use --parallel
if you have enough system resources (in your case this would be mainly CPU dependent because your reference genome should be so small that RAM isn't an issue).
Let me know how you get on. Best, Felix
Hi Felix,
First of all, thanks for all your valuable comments, suggestions and feedback. Indeed, the culprit is --score_min L,-0.6,-1
as I was expecting too (that's why it is in the subject line of the Q). Keeping all other options intact and removing just --score_min
brings back everything in right order
time bismark --un --ambiguous --non_directional bismark-genome-folder -1 1kr1.fq -2 1kr2.fq
real 0m5.474s
user 0m3.863s
sys 0m0.100s
My question is: why the --score_min
param is affecting the speed of mapping about an order of magnitude only in pe-mode and not in se-mode?
The parameters that I chose was due to some historical reasons. Due to a bad illumina miseq run, we had a suspected poor sequencing for these amplicons in the past. When I tried to map them, the mapping % was awful (may be 10% in pe-mode and 20-30% in se-mode). We were trying to get some preliminary info from these sequences anyway (even if it were not very precise, just to get some idea/trend about our data). So I relaxed the mapping as much and use se-mode to get a decent percentage of mapping. I have some R-scripts which take this methylation extractor data for downstream analysis. Then we ran the same samples on hiseq, and I was just trying to compare the data obtained from miseq and hiseq, keeping the same parameters for mapping and analysis. That correlation was good (~0.9). At this point, I thought to recheck what if I map the hiseq data with pe-mode. Do they correlate with hiseq-se mode in the downstream analysis? And this got me to all this speed problem. BTW, the only good news is that the se-mode and partial pe-mode analysis (using 20% of the data mapped till now) appear to correlate well.
If you run FastQC on the data and see the Illumina adapter (red line) in the adapter content plot show it it is definitely a good idea to trim
Not really. Fastqc is very clean as far as adapter contamination goes
And the historical reason for not use trimming => when I started using Bismark (and it was my first experience), I was in a hurry to process the data due to which I missed the recommendation of trimming by trim_galore. Later I did see it (when I got a very poor mapping) and ran on a small set of data as a check if it improves mapping. I didn't get any significant improvement in mapping, and thus I continued with untrimmed mapping (I was also afraid that the trimmed reads of unequal lengths might put more problem for me in the downstream analysis as this is a bit involved project).
A simple run through Trim Galore with the command trim_galore --paired R1.fastq.gz R2.fastq.gz will probably result in an amazing speed increase by itself. Lots of mismatches or adapter contamination will cause Bowtie 2 to try much harder than is needed to produce alignments.
I did not remember considering speed as a factor for using trim_galore. For me, more important point at that time was mapping %. BTW, I am now wondering if you have some empirical data showing drastic speedups using trim_galore?
using -X 1000 is normally unnecessary because most bisulfites reads are much shorter than this (typically peaking around 150bp).
This again was a hack as Bismark was complaining something about insert size and was discarding reads (the exact error msg I don't remember now). I'll later check insert_length from bam file as per your suggestion.
You said you tried using more threads for the alignments, did you use -p for this or --parallel(the former --multicore?
I used --multicore
and I was not aware of these issue with mt-params. BTW, what is your recommendation to optimally combine ---paralle and -p
I am on version v0.16.0 and I am seeing that a new version is out on web. Are their major changes / performance improvements ?
Thanks again for your patience and help! Best, Santosh
Hi Santosh,
It's good to hear that you are mostly on track again. I suppose it is just way more computationally intensive to take care of a paired-end alignment, especially when the reference sequences are repetitive and you are allowing a ton of mismatches. Why this is an order of magnitude slower is probably a question for the Bowtie2 people to be honest.
I agree that the sequences don't appear very adapter contaminated, maybe this is owed to the way you actually constructed the libraries in the first place. Generally, as soon as you start adding wrong sequence to the end of reads in the form of read-through adapter contamination or poor quality base calls the mapping will take longer you will then have to start looking into mismatches and gaps etc in order to align a read (pair). It has been a long time since we ran any alignments without prior trimming so I can't really give you any empirical examples, just from the review we wrote in 2012 I remember that aligning untrimmed sequences added a lot of extra time (while still getting low mapping efficiencies). I guess it wouldn't hurt to run a quick test with say a million or so reads with or without trimming to see if and which difference it makes in your setting.
Regarding the multi-threading: we personally mostly run alignments without parallelisation, but this is mainly because we often run dozens of samples at the same time. We've heard that increasing -p beyond 4 does not result in any further speed increase, while --parallel
seems to behave more or less linearly. So yea I would probably just use the latter and forget about -p
to be honest.
There have been several updates since version 0.16.0 (see here: https://github.com/FelixKrueger/Bismark/blob/master/CHANGELOG.md), so updating to the latest development version is recommended. I am also going to make a new release v0.17.1 in the next few days.
All the best, Felix
Dear Felix, I had been able to solve the pe-mapping problem using your advice (new version 0.18_dev, trimming and removing --score_min). However, I am now facing another problem of very slow methylation extractor in pe-mode. It's running for last 6 hours and is still not finished. Is this usual? Samtools view -c gives about 26million reads in the bam (150bp untrimmed read length). I ran in two ways:
1. ~/utils/Bismark/bismark_methylation_extractor -p --comprehensive --merge_non_CpG --CX --cytosine_report --include_overlap --multicore 16 --samtools_path /software/UHTS/Analysis/samtools/1.3/bin -o s4_trimmed_bismarkNew --genome_folder bismark-index_primerProductsBisABCD_hiseq_NNNNN s4_trimmed_bismarkNew/*.bam > meth.log 2> meth.err
This seems to run forever (>1d) and the culprit seems to be the sort. But I have the context-files generated by this command, so I used the context files to run bismark2bedGraph using--ample_memory
option, like this:
2.
~/utils/Bismark/bismark2bedGraph -o s4-cytocine --ample_memory --dir cytosine_reports --CX_context s4_trimmed_bismarkNew/CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt s4_trimmed_bismarkNew/Non_CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt > bismark2bedGraph.log 2> bismark2bedGraph.err
This is running for almost for 6h. It has not reached the sorting stage yet as the last line of log says _"Now writing methylation information for file >>Non_CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2pe.txt<< to individual files for each chromosome"
In single-end mode, I was able to extract the methylation status in about 2 hours for each read (R1 and R2)
Thanks and Regards, Santosh
Hi Santosh, Generally I would expect anything to do with non-CG information to take a very long time, but since your since your reference sequences are only very short I would expect things to go very fast indeed. So maybe something else is not working as expected then...
Just as a general comment, you normally don't want to use --include_overlap
, ever. --multicore16
will speed up the extraction process, but has no influence on the sorting step.
I would probably take it step by step, and first of all lose the option --ample_memory
in the bismark2bedGraph
command. Just don't redirect all the log files but watch what it happening on screen, it should take more than a few minutes tops until the sorting starts. Another option would be to just try the CpG context files first, this should also be a lot quicker.
Let me know how you get on.
Cheers, Felix
Hi Felix, Thanks again for your help, useful as always!
I am running some tests based on your advice (removing --include_overlap and --ample_memory options). One thing I am not really clear is about the options --buffer_size
and --ample_memory
. They look like doing the same thing but in a mutually exclusive way. Also --ample_memory
comes with the caveat:
Due to overheads in creating and looping through these arrays it seems that it will actually be slower for small files (few million alignments), and we are currently testing at which point it is advisable to use this option.
My Q is: What are the use cases of --buffer_size
and --ample_memory
? Can these option be combined? Does the use cases of options depend upon amount of memory available w.r.t. the genome size? If you have good amount of memory to spare (~70-100GB), which one option would you prefer?
By default, bismark2bedGraph
uses the UNIX sort
command. --buffer_size
specifies the amount of memory to be used by sort
, in addition to writing out temporary files to disk (to the output directory) if the memory allowed for sorting is exhausted (the default is 2G I believe).
--ample_memory
on the other hand uses the Perl internal sort function, which means that the sorting will happen in memory only. Since Perl arrays are not the most memory efficient data structures this might require quite a bit of RAM, but once it is in memory the process itself should be fairly quick again.
With the tiny references you have it shouldn't make much of a difference to be honest, but it almost sounds like there is something fishy going on with the --ample_memory
option... Unless you have a specific reason for selecting specific options I would always go for the defaults.
Dear Felix, thanks again for the help and feedback. Here are some updates regarding my tests (extraction only on CpGs).
Test1: using --ample_memory
time ~/utils/Bismark/bismark2bedGraph -o s4-cytocine_ampMem --ample_memory --dir tmp_cytosine_reports s4_trimmed_bismarkNew/CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt > ampMem.log 2> ampMem.err
Test2: using --buffer_size 50%
time ~/utils/Bismark/bismark2bedGraph -o s4-cytocine_buff50 --buffer_size 50% --dir tmp_cytosine_reports s4_trimmed_bismarkNew/CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt > buff50.log 2> buff50.err
The Test1 finished correctly, while Test2 exited with these errors.
Sorting input file CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt.chremptyVector.methXtractor.temp by positions (using -S of 50%) The temporary inputfile CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt.chremptyVector.methXtractor.temp could not be deleted No such file or directory
Sorting input file CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt.chrFoxD3.wt_A.methXtractor.temp by positions (using -S of 50%) sort: open failed: CpG_context_S4_L1_R1_001_B1WLr2YqTKWu_val_1_bismark_bt2_pe.txt.chrFoxD3.wt_A.methXtractor.temp: No such file or directory Died at /home/sanand/utils/Bismark/bismark2bedGraph line 486.
The runtime was as follows
real 675m16.639s
user 49m15.804s
sys 2m42.680s
real 677m33.749s
user 150m11.880s
sys 3m1.833s
So my conclusion is that the extraction process is extremely slow in either way. I think that Test2 exited because there was some conflict in the temp files created by Test1 and Test2 as they were being created in the same dir, otherwise Test2 was working fine till sorting stage. Also, this makes me conclude that the Test2 was lagging much behind Test1 as it was working just at sorting stage when Test1 finished the job.
I'll try to run these tests again in a cleaner way (in two separate directory). But as such, it seems extremely slow. What is your comment/feedback regarding this? Note: CpG_context file contains total 180160939 lines (wc -l output)
Some feedback for future release: (Let me know if you want these to be moved to an independent thread.
It will be useful if any temp file generated by Bismark is also marked with some random number/string for uniqueness. This will help avoiding any conflict while running on concurrent tests on same data. I understand that there is an option -o
to send output to different directory, but didn't think of using it as my output files of two tests were named differently.
Timestamp of the various logs: I'd suggest appending time-stamp to all the important portions of logs. This will help in knowing what happens when and also the overall time of the process. Also I don't understand why the logs are sent to STDERR channel; is this a design choice?
Parsing and checking the parameters before the actual run: While working on v0.16, I once gave a wrong param to Bismark. The process went ahead with converting C->T etc. of reads. Only later it exited with the error that there is an incorrect param. But this costed a lot of time in doing the conversion. BTW, is there a way to save the converted reads for future, and start the process from there directly? This will help in faster debugging of any downstream errors.
Harmonize the option names across various Bismark tools, especially for output file/directory names. Currently, there are similar, but not exactly the same option names across tools. For example:
Bismark -o/--output_dir <dir> : Write all output files into this directory. By default the output files will be written into the same folder as the input file. If the specified folder does not exist, Bismark will attempt to create it first. The path to the output folder can be either relative or absolute.
meth_extractor -o/--output DIR : Allows specification of a different output directory (absolute or relative path). If not specified explicitly, the output will be written to the current directory.
bismark2bedGraph -o/--output
: Name of the output file, mandatory. --dir : Output directory. Output is written to the current directory if not specified explicitly.
Updates: I ran three cleaner tests in separate directories Test1: using --ample_memory Test2: using --buffer_size 50% Test3: usning both --ample_memory and --buffer_size 50%
Test1 and Test3 are finished in almost equal times. I was expecting this as turning on --ample_memory will anyway not use --buffer_size (no Unix sorting) - but just wanted to be sure - so this test. Comparing the files of Test1 of this time vs. the last time, they are exactly the same. So I assume that also the last time it ran fine, although took twice the time in previous instance than now!
Test2 is still running....
Test1
real 325m4.278s
user 45m59.528s
sys 1m31.593s
Test3
real 325m12.389s
user 45m48.198s
sys 1m31.459s
Hi sansense,
Is there any update for the the buffer size 50% option? I was thinking maybe it has to do with the % value you specified? I always use --buffer 10G
over here.
Thanks for your general comments above. I know that some of them could be helpful for certain testing scenarios (e.g. the random unique portion of temp filenames or time stamps), but they won't have any immediate effect for actual day to day alignment jobs. I will keep it in mind though if I have a bit of spare time available.
Regarding testing the parameters before the run starts: I have started to make an effort testing most things before the run actually commences to avoid wasting time, so can you remember what it actually was that caused this? I'd be happy to add ahead-of-run checking if that is possible at all. Regarding starting a run from the already converted files: there is currently no option to do this, and in the vast majority of times you will never need to go back to the mapped file a second time. So while this seems like a useful feature I would argue that is also falls into the category of nice things to have, which they would cost me quite some time to implement (especially considering all the different alignment modes with or without the --parallel
option...) but eventually they would probably only ever be used in a handful of special testing cases...
And lastly regarding the harmonisation of options: while I have already tried to harmonise the options as much as possible I realise that there are still some cases (e.g. the output directory) that are still at odds. As you can probably imagine this has mostly historic reasons, and if I were to design a suite of tools from scratch again I would almost certainly think of a unified set of option/commands to start with. We have tried to keep the options between Bismark and the methylation extractor the same at least, and under normal circumstances there is no need for users to use bismark2bedGraph
or coverage2sytosine
individually as they are integrated into the methylation extractor already, so again it would only ever affect a very small number of people....
Hi Felix, The time for --buffer_size 50%
real 427m6.137s
user 152m55.952s
So the Linux sort command take ~1h 30min longer for the same data set. Is the output equivalent?
Unix cmp
says that they are exactly the same
That's excellent, so it's all working then. Shall we close the issue?
My question was about the extraction time, that seems to be pretty slow. So a full extractions (CpG + Non_CpG) will take almost half day. Is that usual?
Technically, the extraction process (which comes first) always extracts CpG and non-CG context. This process does many billions of operations, but can be sped up using --parallel
.
The sorting process which comes afterwards can take a while since you need to keep track of tens of millions (CpG) or even billions (all Cs) of C positions at the same time. A couple of hours or up to a day are quite common indeed. This could probably be optimised by clever parallel sorting, but given that you only do this process once and then never touch it again we think it's worth waiting 2 hours extra... (especially since sample prep to alignment most likely took several weeks or months already...)
Hi Team,
I am running Bismark with the below command. I can see it's really slow.
bismark --bowtie2 -N 1 --parallel 4 $RESOURCES2/HG38/ -1 $Read1 -2 $Read2 --output_dir $PWD --temp_dir $PWD/$SAMPLEID"_TEMP" --prefix $SAMPLEID
I have 218,289,382 total reads i.e. 32.96(Gb) data. I did pre processing using Trimmomatic. Filtered reads are used for this. It took ~22 h to complete. Can you please help with this?
Bismark Version: v0.24.0
I am trying to map paired-end reads from an amplicon sequencing using Bismark. The command line is
This is running extremely slow @ around 50000 reads / hour / thread (an estimate from bam-file). I've checked the mapping on three different machines having different specs using: 1) single thread 2) 4 threads and 3) 10 threads. The reaultant speed I'm getting is similar.
I had run the same mapping in single end mode in past (for R1 and R2 separately) and it ran at a reasonable speed. I don't remember the exact speed but at least it was finished in 1-2 days, whereas I'm still waiting for the results of mapping after 3 days (single thread) or 1 day (10 threads).
Next, I did a quick check to see the differences in time by using first 1000 reads of the same (as above) sample
single-end
paired-end
This suggests that paired-end mapping is running at least 10 times slower! Has anyone faced similar issues? How to make the mapping faster?
Finer Details: My reference sequences (total ~250) are almost the same (excluding difference of around 30-40 bp in the middle of sequences), length around 800bp each. 150bp, untrimmed reads.
PS: Cross posted on Biostars https://www.biostars.org/p/252276/