OpenGene / fastp

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
MIT License
1.87k stars 332 forks source link

fastp is slow #460

Open fo40225 opened 1 year ago

fo40225 commented 1 year ago

I'm evaluating replacing our pipeline from trimmomatic to fastp, but fastp is slower than trimmomatic.

fastp: 2:18.08 trimmomatic: 1:09.35

fastp command:

/usr/bin/time -v /path/to/fastp-v0.23.2/fastp --compression 2 --thread $(nproc) \
    --in1 WES-NA12878_S125_L002_R1_001.fastq.gz \
    --in2 WES-NA12878_S125_L002_R2_001.fastq.gz \
    --out1 WES-NA12878_S125_L002_R1_001.clean.fastq.gz \
    --unpaired1 WES-NA12878_S125_L002_R1_001.unpaired.fastq.gz \
    --out2 WES-NA12878_S125_L002_R2_001.clean.fastq.gz \
    --unpaired2 WES-NA12878_S125_L002_R2_001.unpaired.fastq.gz \
    --detect_adapter_for_pe \
    --dont_eval_duplication \
    --disable_trim_poly_g \
    --disable_quality_filtering \
    --disable_length_filtering \
    &>fastp.log

fastp.log

Detecting adapter sequence for read1...
>Illumina TruSeq Adapter Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA

Detecting adapter sequence for read2...
>Illumina TruSeq Adapter Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

WARNING: fastp uses up to 16 threads although you specified 96
Read1 before filtering:
total reads: 65293316
total bases: 9859290716
Q20 bases: 9578075227(97.1477%)
Q30 bases: 9139319600(92.6975%)

Read2 before filtering:
total reads: 65293316
total bases: 9859290716
Q20 bases: 9590024704(97.2689%)
Q30 bases: 9131923230(92.6225%)

Read1 after filtering:
total reads: 65293298
total bases: 9021103280
Q20 bases: 8806066506(97.6163%)
Q30 bases: 8429071860(93.4373%)

Read2 after filtering:
total reads: 65293298
total bases: 9023606648
Q20 bases: 8796134635(97.4791%)
Q30 bases: 8395517424(93.0395%)

Filtering result:
reads passed filter: 130586596
reads failed due to low quality: 0
reads failed due to too many N: 0
reads with adapter trimmed: 42098347
bases trimmed due to adapters: 1673866640

Insert size peak (evaluated by paired-end reads): 131

JSON report: fastp.json
HTML report: fastp.html

/path/to/fastp-v0.23.2/fastp --compression 2 --thread 96 --in1 WES-NA12878_S125_L002_R1.fastq.gz --in2 WES-NA12878_S125_L002_R2.fastq.gz --out1 WES-NA12878_S125_L002_R1.clean.fastq.gz --unpaired1 WES-NA12878_S125_L002_R1.unpaired.fastq.gz --out2 WES-NA12878_S125_L002_R2.clean.fastq.gz --unpaired2 WES-NA12878_S125_L002_R2.unpaired.fastq.gz --detect_adapter_for_pe --dont_eval_duplication --disable_trim_poly_g --disable_quality_filtering --disable_length_filtering
fastp v0.23.2, time used: 138 seconds
        Command being timed: "/path/to/fastp-v0.23.2/fastp --compression 2 --thread 96 --in1 WES-NA12878_S125_L002_R1.fastq.gz --in2 WES-NA12878_S125_L002_R2.fastq.gz --out1 WES-NA12878_S125_L002_R1.clean.fastq.gz --unpaired1 WES-NA12878_S125_L002_R1.unpaired.fastq.gz --out2 WES-NA12878_S125_L002_R2.clean.fastq.gz --unpaired2 WES-NA12878_S125_L002_R2.unpaired.fastq.gz --detect_adapter_for_pe --dont_eval_duplication --disable_trim_poly_g --disable_quality_filtering --disable_length_filtering"
        User time (seconds): 1640.37
        System time (seconds): 11.35
        Percent of CPU this job got: 1196%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:18.08
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1820840
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 4
        Minor (reclaiming a frame) page faults: 1929753
        Voluntary context switches: 7396816
        Involuntary context switches: 2598
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

trimmomatic command:

export PATH=/path/to/htslib-1.16:$PATH
export PATH=/usr/lib/jvm/zulu17/bin:$PATH
export LD_LIBRARY_PATH=/path/to/zlib-cloudflare/build-avx512
export JAVA_TOOL_OPTIONS='-XX:+UnlockExperimentalVMOptions -XX:+UseParallelGC -XX:UseAVX=3'
/usr/bin/time -v java -XX:+UseNUMA -jar /path/to/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads $(nproc) -phred33 \
    <(bgzip -c -d -@ $(nproc) WES-NA12878_S125_L002_R1_001.fastq.gz) \
    <(bgzip -c -d -@ $(nproc) WES-NA12878_S125_L002_R2_001.fastq.gz) \
    >(bgzip -c -l 2 -@ $(nproc) >WES-NA12878_S125_L002_R1_001.clean.fastq.gz) \
    >(bgzip -c -l 2 -@ $(nproc) >WES-NA12878_S125_L002_R1_001.unpaired.fastq.gz) \
    >(bgzip -c -l 2 -@ $(nproc) >WES-NA12878_S125_L002_R2_001.clean.fastq.gz) \
    >(bgzip -c -l 2 -@ $(nproc) >WES-NA12878_S125_L002_R2_001.unpaired.fastq.gz) \
    ILLUMINACLIP:/path/to/Trimmomatic-0.39/adapters/TruSeq3-PE-2.fa:2:30:10:1:true \
    &>trimmomatic.log

trimmomatic.log

Picked up JAVA_TOOL_OPTIONS: -XX:+UnlockExperimentalVMOptions -XX:+UseParallelGC -XX:UseAVX=3
TrimmomaticPE: Started with arguments:
 -threads 96 -phred33 /dev/fd/63 /dev/fd/62 /dev/fd/61 /dev/fd/60 /dev/fd/59 /dev/fd/58 ILLUMINACLIP:/path/to/Trimmomatic-0.39/adapters/TruSeq3-PE-2.fa:2:30:10:1:true
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 65293316 Both Surviving: 65293293 (100.00%) Forward Only Surviving: 2 (0.00%) Reverse Only Surviving: 21 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
        Command being timed: "java -XX:+UseNUMA -jar /path/to/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 96 -phred33 /dev/fd/63 /dev/fd/62 /dev/fd/61 /dev/fd/60 /dev/fd/59 /dev/fd/58 ILLUMINACLIP:/path/to/Trimmomatic-0.39/adapters/TruSeq3-PE-2.fa:2:30:10:1:true"
        User time (seconds): 2421.43
        System time (seconds): 104.66
        Percent of CPU this job got: 3642%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:09.35
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 10842944
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 166
        Minor (reclaiming a frame) page faults: 11997356
        Voluntary context switches: 298764
        Involuntary context switches: 23814
        Swaps: 0
        File system inputs: 8200
        File system outputs: 480
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

OS: Ubuntu 22.04.1 LTS 5.15.0-58-generic

Hardware: 2x Intel Xeon Gold 6342 2TB RAM (16x 128GB DDR4-3200) 4x 4TB PCI-E 4.0 NVME SSD RAID 0

fastp is download from http://opengene.org/fastp/fastp.0.23.2 trimmomatic is download from http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip

Input and output files are placed on ramdisk (/dev/shm). Execute commands while the system is idle.

I see WARNING: fastp uses up to 16 threads although you specified 96 in log, what part of the algorithm limits the upper limit of 16 threads?

sfchen commented 1 year ago

So far, I can't repeat this problem.

Furthermore, there were 6 bzip2 processed invoked but not counted.

Fastp sets 16-thread limit since 16-thread is always enough.

fo40225 commented 1 year ago

bgzip is running concurrently, it didn't increase wall clock time.

According to Figure 1a in the paper, the algorithm should be able to scale out.

If I have time, I will help design some experiments and make improvements.

Is this repo still accepting pull requests?