OpenGene / fastp

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
MIT License
1.95k stars 334 forks source link

Output read lengths are affected by duplicate --adapter_sequence arguments #575

Open mwhamgenomics opened 2 months ago

mwhamgenomics commented 2 months ago

I've been running fastp as part of a larger third-party pipeline (i.e. not written or maintained by me), and noticed that it was specifying adapter sequences multiple times on the command line:

    ...
    --adapter_sequence CTGTCTCTTATACACATCT \
    --adapter_sequence AGATGTGTATAAGAGACAG \
    --adapter_sequence AGATGTGTATAAGAGACAG \
    --adapter_sequence CTGTCTCTTATACACATCT \
    ...

I tried seeing what fastp would do without the duplicate arguments, expecting to get the same results:

    ...
    --adapter_sequence CTGTCTCTTATACACATCT \
    --adapter_sequence AGATGTGTATAAGAGACAG \
    ...

But I found that in some cases my read lengths were now different - sometimes only r1 was affected, sometimes only r2, sometimes both. The adapter sequences being specified don't even appear in the fastqs in this case, so I expected them to have no effect.

Steps to reproduce:

# GiaB test data
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R{1,2}_001.fastq.gz

# fastp 0.23.4
wget http://opengene.org/fastp/fastp.0.23.4
chmod u+x fastp.0.23.4
ln -s fastp.0.23.4 fastp

# proof that the adapter sequences are absent in the fastqs - so surely should have no effect?
for f in U0a_CGATGT_L001_R*; do echo $f; for a in CTGTCTCTTATACACATCT AGATGTGTATAAGAGACAG; do zcat $f | grep -c $a; done; done

# subset to a minimal example of 3 reads known to be affected
zcat U0a_CGATGT_L001_R1_001.fastq.gz | grep -E '^@HWI-D00360:5:H814YADXX:1:1101:(3756:2236|7206:2194|5147:4880)' -A 3 --no-group-separator | head -n 12 | gzip -c > minimal_r1.fastq.gz
zcat U0a_CGATGT_L001_R2_001.fastq.gz | grep -E '^@HWI-D00360:5:H814YADXX:1:1101:(3756:2236|7206:2194|5147:4880)' -A 3 --no-group-separator | head -n 12 | gzip -c > minimal_r2.fastq.gz

# run fastp with/without duplicated --adapter_sequence args
fastp -i minimal_r1.fastq.gz -I minimal_r2.fastq.gz -o r1_trimmed.fastq.gz -O r2_trimmed.fastq.gz
    --thread 8 \
    --adapter_sequence CTGTCTCTTATACACATCT \
    --adapter_sequence AGATGTGTATAAGAGACAG \
    --adapter_sequence AGATGTGTATAAGAGACAG \
    --adapter_sequence CTGTCTCTTATACACATCT

fastp -i minimal_r1.fastq.gz -I minimal_r2.fastq.gz -o r1_trimmed_nodup.fastq.gz -O r2_trimmed_nodup.fastq.gz
    --thread 8 \
    --adapter_sequence CTGTCTCTTATACACATCT \
    --adapter_sequence AGATGTGTATAAGAGACAG

The above example consists of three reads, which were each affected in the same way in both the minimal fastqs above and the full size ones:

Do you know what could be causing this? Is it an expected use-case to specify the same adapter sequence multiple times?