Open jnmaloof opened 2 months ago
And even worse, if I change the number of threads to 1 I get very different results:
~/Assignments/assignment-09-jnmaloof/input/Brapa_fastq$ for i in $(seq 1 10); do echo "run $i"; fastp --in1 GH.lane67.fastq.gz --out1 GH.lane67.trimmed.fastq.gz --length_required 50 -q 20 -u 70 --phred64 --thread 1 --cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 2>&1 >/dev/null | grep -E "(too short)|(total bases: 9)" ; done
run 1
total bases: 95838782
reads failed due to too short: 18579
run 2
total bases: 95252057
reads failed due to too short: 21209
run 3
total bases: 95333601
reads failed due to too short: 20843
run 4
total bases: 95233893
reads failed due to too short: 21348
run 5
total bases: 95322517
reads failed due to too short: 20845
run 6
total bases: 95282512
reads failed due to too short: 21062
run 7
total bases: 95392059
reads failed due to too short: 20542
run 8
total bases: 95425813
reads failed due to too short: 20426
run 9
total bases: 95422163
reads failed due to too short: 20373
run 10
total bases: 95362468
reads failed due to too short: 20601
Examining thread dependency a bit more:
for t in {1..8}
do
for i in {1..2}
do
echo "threads $t rep $i"
fastp --in1 GH.lane67.fastq.gz --out1 GH.lane67.trimmed.fastq.gz --length_required 50 -q 20 -u 70 --phred64 --thread $t --cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 2>&1 >/dev/null | grep -E "(too short)|(total bases: 9)"
done
done
threads 1 rep 1
total bases: 95469777
reads failed due to too short: 20233
threads 1 rep 2
total bases: 95242300
reads failed due to too short: 21336
threads 2 rep 1
total bases: 97828605
reads failed due to too short: 9742
threads 2 rep 2
total bases: 97857064
reads failed due to too short: 9565
threads 3 rep 1
total bases: 99872436
reads failed due to too short: 581
threads 3 rep 2
total bases: 99899052
reads failed due to too short: 451
threads 4 rep 1
total bases: 99939418
reads failed due to too short: 284
threads 4 rep 2
total bases: 99930726
reads failed due to too short: 318
threads 5 rep 1
total bases: 99926647
reads failed due to too short: 338
threads 5 rep 2
total bases: 99926602
reads failed due to too short: 345
threads 6 rep 1
total bases: 99923806
reads failed due to too short: 346
threads 6 rep 2
total bases: 99917299
reads failed due to too short: 384
threads 7 rep 1
total bases: 99894695
reads failed due to too short: 490
threads 7 rep 2
total bases: 99904110
reads failed due to too short: 441
threads 8 rep 1
total bases: 99888522
reads failed due to too short: 510
threads 8 rep 2
total bases: 99892127
reads failed due to too short: 514
If I revert to version 0.20.1 then things are repreoducible (and give me a different result than any of those above: 42341 reads are removed for being too short). That is on par with what trimmomatic returns and is probably the correct result.
I stumbled upon the same problem with version 0.23.4, but found that the results are reproducible when I run using 1 thread. It's interesting that version 0.23.0 claims to have fixed the reproducibility problem...
could you please give me a piece of sample data, along with the command ?
My test dataset is too large to share, but here is the exact fastp call I used
fastp -i SRR13921546_sub_1.fastq.gz -I SRR13921546_sub_2.fastq.gz -o SRR13921546_filter_1.fastq.gz -O SRR13921546_filter_2.fastq.gz -j SRR13921546_filter.json -w 1 --dedup
I'm running in a modified version of this docker container which is based on ubuntu "mambaorg/micromamba:1.5.8-jammy"
fastp installed with micromamba
RUN micromamba create -q -y -c conda-forge -c bioconda -n fastp fastp=0.23.4 && micromamba clean --all -y
I used diff
to compare the .json file from multiple runs. Much of it is identical, but not entirely. I think I tried removing the dedup and that did not solve it. Only setting to 1 thread fixed it.
Honestly, fastp is so fast that 1 thread is still usable. Love the program and thanks for following up!
EDIT: If you really want to recreate my test set, you can download the sequencing from SRR13921546 and then take the first million reads
You can download my sample dataset here:
https://bis180ldata.s3.amazonaws.com/downloads/Illumina_Assignment/GH.lane67.fastq.gz
My commands are in my earlier posts in this thread.
Version 0.23.4
I get a different number of reads and bases after filtering every time I run fastp