FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
385 stars 101 forks source link

Speed up merge_individual_BAM_files #707

Open MilosCRF opened 2 days ago

MilosCRF commented 2 days ago

Hi Felix,

I’m currently running Bismark on AWS using 192 CPUs. Everything is running at lightning speed except for merging the temporary BAM files. It seems that the merging process is not utilizing multiple cores with the current bismark settings.

Do you know of any way to speed up this step? Perhaps adding -@ $num_threads to the samtoolscommand that is likely handling the merging?

FelixKrueger commented 1 day ago

samtools cat does have a -@ flag, but I am not sure if this would at any point break the order of Read1/Read2 following each other directly (which would break downstream processed). Do you have possibility to find the command that merges the BAM files and try out adding increasing values of -@?

MilosCRF commented 1 day ago

Thank you, Felix. Yes, I can definitely test it. However, I couldn't find samtools cat in Bismark's main script. Isn't the merge accomplished by samtools view at line 1427?

open (OUT,"| $samtools_path view -bSh 2>/dev/null - > ${output_dir}${merged_name}") or die "Failed to write to $merged_name: $!\n";

FelixKrueger commented 12 hours ago

I think you are right, I must have confused this with something else (e.g. deduplication of multiple files).

Can you try to try a different number of threads? E.g. 2, 4, 8? We only have to be mindful how this would affect resource allocation for e.g. nf-core workflows, but this would be a downstream problem.