FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
366 stars 101 forks source link

bismark aligning comma-separated list of fastq files stops after first sample finished #637

Closed chuddy-ibk closed 7 months ago

chuddy-ibk commented 7 months ago

Dear colleagues,

I am re-running a WGBS pipeline to see how well it can be replicated with the partial code i have.

I am no stuck but I want the code to be more efficient and not wait for me until i always initiate to continue with the next sample(pair) after one sample was aligned. So I use following script:

_echo "Bismark aligning" input_files_1="" input_files_2="" for file in fastq/trim/_R1_001_val_1.fq.gz; do input_files_1+="${file}," done for file in fastq/trim/_R2_001_val_2.fq.gz; do input_files_2+="${file}," done input_files_1=${input_files_1%,} # Remove the trailing comma input_files_2=${input_files_2%,}

bismark --genome ~/bioinformatics/ref_genomes/mouse_38/genome \ -1 "${input_files_1}" -2 "${input_files_2}" \ -o BAM/prededuplicate/ --temp_dir BAM/ \ --parallel 3 -q --scoremin L,0,-0.2 --maxins 500

the input_files_1 variable would then have following sample names saved (comma separated as requested in the bismark --help): _fastq/trim/Ctrl-1_R1_001_val_1.fq.gz,fastq/trim/Ctrl-2_R1_001_val_1.fq.gz,fastq/trim/F1-1_R1_001_val_1.fq.gz,fastq/trim/F1-2_R1_001_val1.fq.gz

according to the first lines after starting the alignment everything seems to be fine as all fastq files were detected: _Input files to be analysed (in current folder '/home/chuddy/bioinformatics/lamarck-project'): fastq/trim/Ctrl-1_R1_001_val_1.fq.gz fastq/trim/Ctrl-1_R2_001_val_2.fq.gz fastq/trim/Ctrl-2_R1_001_val_1.fq.gz fastq/trim/Ctrl-2_R2_001_val_2.fq.gz fastq/trim/F1-1_R1_001_val_1.fq.gz fastq/trim/F1-1_R2_001_val_2.fq.gz fastq/trim/F1-2_R1_001_val_1.fq.gz fastq/trim/F1-2_R2_001_val2.fq.gz Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!)

After 887 minutes of running time, i received a bam file, which looked okay, also according the detection of C in CpG context, etc.

What did I do wrong, since normally the alignment of the second sample should start immediately after the first finished? Since 887 minutes is a long time, I wonder how i can speed things up? I have difficulties estimating what my mobile workstation is capable of carrying out. I used parallel 3 to be on the save side, although I have 24 CPUs and approx 62 GB of memory. I am working with the mouse genome (mm10, from ensembl).

Bismark Version: v0.24.1 bowties2 version 2.5.1

If anything else is needed to help me, pls tell me so and i will happily deliver.

Best Tom

FelixKrueger commented 7 months ago

Hi Tom,

Updating to v0.24.2 (https://github.com/FelixKrueger/Bismark/releases/tag/v0.24.2) should fix the issue that the run stops after the first set of files (there was an exit 0 in the wrong scope...

I suppose you might get away with --parallel 4 on that machine, but I would monitor closely whether some of the alignment threads run OOM. Good luck!