Inlcude thread settings for bowtie2 instances

alexg9010 commented 5 years ago

I recently run the pipeline on a large dataset with 28 samples and during this I made some observations:

When using N multiple cores (cores set in settings file, --multicore set in bismark call), we are producing parallel instances of Bismark, i.e. Bismark will take the original (gzipped) fastq file, split it into subsets of N uncompressed fastq files and then run Bismark workflow on each subset, where 2/4 instances (directional/non-directional-mode) of Bowtie2 are run, each with 1 thread. This way we might be using less resources (cores/mem) per bowtie2 instance, but are producing a lot IO on the disk, which in general is slower than keeping stuff in memory.

See multicore setting in Bismark: https://github.com/FelixKrueger/Bismark/blob/master/docs/options/alignment.md#alignment

What we could do instead / in addition is to use adjust the number of threads per bowtie2 instance using the -p argument to Bismark and decrease the number of parallel Bismark instances (--multicore), which would increase the resource usage, but should decrease the IO load.

https://github.com/FelixKrueger/Bismark/blob/master/docs/options/alignment.md#parallelization-options

alexg9010 commented 5 years ago

Here is some collected from the recent run with Bismark --multicore 3 (took ~ 4days ): input:

raw gzipped fastq file size: R1 ~ 30G ; R2 ~ 37G = ~ 80G
trimmed gzipped fastq file size: R1 ~ 30G ; R2 ~ 35G = ~ 80G

temp files during mapping:

3x subsetted fastq files R1 and R2 , each ~ 50G: 2x3x50G = 300G
C_to_T and A_to_G converted versions of fastq files: 2x 300G = 600G
3x alignment Bam Files, each ~ 20G: 3x 20G = 60G

--> up to ~ 660G per Sample

Considering we would only run one instance of Bismark instead of three we could probably reduce tempfiles ton 220G per sample. I do not know what the loss in runtime would be, but I assume it would be very high considering we have to write fewer times to the disk.

alexg9010 commented 5 years ago

I will test my idea by using the args in the settings file and then report wether this will help.

alexg9010 commented 5 years ago

RP6_bismark_pe_mapping.multicore3.log RP6_bismark_pe_mapping.parallel8.log

--multicore 3
==============

bismark instances:  3
bismark threads:    3 * 1 = 3
bowtie instances:   3 * 2 = 6
bowtie threads:     6 * 1 = 6 
samtools threads:   6 * 1 = 6
gzip threads:       6 * 1  = 6 

total threads:      3 + 6 + 6 + 6 = 21
runtime:    4d 11h 32m 19s

approx. ressources: 
------------
hulk memory usage during run (28 samples plus other users) : ~ 700 GB 
disk:   14 TB/d in 1 h = 14/24 = ~ 0.58 TB = 600 GB

-p 8 
=============

bismark instances:          1
bismark threads:            1
bowtie instances:   2 =     2
bowtie threads:     2 * 8 = 16
samtools threads:   2 * 1  = 2
gzip threads:       2 * 1   = 2

total threads: 1 + 16 + 2 + 2 = 21
runtime:    1d 6h 2m 42s

appro. ressources:
------------
hulk memory usage during run (1 sample plus other users):  ~ 200 GB
disk:   14 TB/d in 1 h = 14/24 = ~ 0.58 TB = 600 GB

fortune9 commented 1 year ago

Any update on this issue? I also met the issue of slow run of bismark on one sample, and each bowtie2 instance can use 30GB memory. Need some clues to fix it.

alexg9010 commented 1 year ago

Hi @fortune9,

As you can see from my previous comment, in our case it seemed that using fewer parallel instances (--multicore 3 -> 1) and instead dedicating more threads to bowtie2 (-p 1 -> 8 ) lead to a 3.5X increase in processing speed without a major increase in memory usage.

In case you are using the pigx-bsseq pipeline, this would correspond to changing the bismark arguments in the tools section from the default (--multicore 3 -N 0 -L 20):

tools:
  bismark:
        args: " -N 0 -L 20 "
    cores: 3

to this (--multicore 1 -p 8 -N 0 -L 20):

tools:
  bismark:
        args: "-p 8  -N 0 -L 20 "
    cores: 1

BIMSBbioinfo / pigx_bsseq

Inlcude thread settings for bowtie2 instances #147