OpenGene / fastp

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
MIT License
1.83k stars 335 forks source link

Not reproducible #562

Open jnmaloof opened 2 months ago

jnmaloof commented 2 months ago

Version 0.23.4

I get a different number of reads and bases after filtering every time I run fastp

(base) exouser@julin-2:$ for i in $(seq 1 10); do echo "run $i";  fastp --in1 GH.lane67.fastq.gz --out1 GH.lane67.trimmed.fastq.gz --length_required 50 -q 20 -u 70 --phred64 --cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 2>&1 >/dev/null | grep -E "(too short)|(total bases: 9)" ; done
run 1
total bases: 99875000
reads failed due to too short: 593
run 2
total bases: 99844193
reads failed due to too short: 705
run 3
total bases: 99868071
reads failed due to too short: 595
run 4
total bases: 99864238
reads failed due to too short: 608
run 5
total bases: 99862997
reads failed due to too short: 625
run 6
total bases: 99900970
reads failed due to too short: 451
run 7
total bases: 99894738
reads failed due to too short: 454
run 8
total bases: 99879032
reads failed due to too short: 538
run 9
total bases: 99844285
reads failed due to too short: 711
run 10
total bases: 99875189
reads failed due to too short: 575
jnmaloof commented 2 months ago

And even worse, if I change the number of threads to 1 I get very different results:

~/Assignments/assignment-09-jnmaloof/input/Brapa_fastq$ for i in $(seq 1 10); do echo "run $i";  fastp --in1 GH.lane67.fastq.gz --out1 GH.lane67.trimmed.fastq.gz --length_required 50 -q 20 -u 70 --phred64 --thread 1 --cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 2>&1 >/dev/null | grep -E "(too short)|(total bases: 9)" ; done
run 1
total bases: 95838782
reads failed due to too short: 18579
run 2
total bases: 95252057
reads failed due to too short: 21209
run 3
total bases: 95333601
reads failed due to too short: 20843
run 4
total bases: 95233893
reads failed due to too short: 21348
run 5
total bases: 95322517
reads failed due to too short: 20845
run 6
total bases: 95282512
reads failed due to too short: 21062
run 7
total bases: 95392059
reads failed due to too short: 20542
run 8
total bases: 95425813
reads failed due to too short: 20426
run 9
total bases: 95422163
reads failed due to too short: 20373
run 10
total bases: 95362468
reads failed due to too short: 20601
jnmaloof commented 2 months ago

Examining thread dependency a bit more:

for t in {1..8}
    do
        for i in {1..2}
            do
                echo "threads $t rep $i"
                fastp --in1 GH.lane67.fastq.gz --out1 GH.lane67.trimmed.fastq.gz --length_required 50 -q 20 -u 70 --phred64 --thread $t  --cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 2>&1 >/dev/null | grep -E "(too short)|(total bases: 9)"
            done
    done
threads 1 rep 1
total bases: 95469777
reads failed due to too short: 20233
threads 1 rep 2
total bases: 95242300
reads failed due to too short: 21336
threads 2 rep 1
total bases: 97828605
reads failed due to too short: 9742
threads 2 rep 2
total bases: 97857064
reads failed due to too short: 9565
threads 3 rep 1
total bases: 99872436
reads failed due to too short: 581
threads 3 rep 2
total bases: 99899052
reads failed due to too short: 451
threads 4 rep 1
total bases: 99939418
reads failed due to too short: 284
threads 4 rep 2
total bases: 99930726
reads failed due to too short: 318
threads 5 rep 1
total bases: 99926647
reads failed due to too short: 338
threads 5 rep 2
total bases: 99926602
reads failed due to too short: 345
threads 6 rep 1
total bases: 99923806
reads failed due to too short: 346
threads 6 rep 2
total bases: 99917299
reads failed due to too short: 384
threads 7 rep 1
total bases: 99894695
reads failed due to too short: 490
threads 7 rep 2
total bases: 99904110
reads failed due to too short: 441
threads 8 rep 1
total bases: 99888522
reads failed due to too short: 510
threads 8 rep 2
total bases: 99892127
reads failed due to too short: 514
jnmaloof commented 2 months ago

If I revert to version 0.20.1 then things are repreoducible (and give me a different result than any of those above: 42341 reads are removed for being too short). That is on par with what trimmomatic returns and is probably the correct result.

peter-kanvas commented 1 month ago

I stumbled upon the same problem with version 0.23.4, but found that the results are reproducible when I run using 1 thread. It's interesting that version 0.23.0 claims to have fixed the reproducibility problem...

sfchen commented 2 weeks ago

could you please give me a piece of sample data, along with the command ?

peter-kanvas commented 2 weeks ago

My test dataset is too large to share, but here is the exact fastp call I used fastp -i SRR13921546_sub_1.fastq.gz -I SRR13921546_sub_2.fastq.gz -o SRR13921546_filter_1.fastq.gz -O SRR13921546_filter_2.fastq.gz -j SRR13921546_filter.json -w 1 --dedup

I'm running in a modified version of this docker container which is based on ubuntu "mambaorg/micromamba:1.5.8-jammy"

fastp installed with micromamba RUN micromamba create -q -y -c conda-forge -c bioconda -n fastp fastp=0.23.4 && micromamba clean --all -y

I used diff to compare the .json file from multiple runs. Much of it is identical, but not entirely. I think I tried removing the dedup and that did not solve it. Only setting to 1 thread fixed it.

Honestly, fastp is so fast that 1 thread is still usable. Love the program and thanks for following up!

EDIT: If you really want to recreate my test set, you can download the sequencing from SRR13921546 and then take the first million reads

jnmaloof commented 4 days ago

You can download my sample dataset here:

https://bis180ldata.s3.amazonaws.com/downloads/Illumina_Assignment/GH.lane67.fastq.gz

My commands are in my earlier posts in this thread.