I've been running fastp as part of a larger third-party pipeline (i.e. not written or maintained by me), and noticed that it was specifying adapter sequences multiple times on the command line:
But I found that in some cases my read lengths were now different - sometimes only r1 was affected, sometimes only r2, sometimes both. The adapter sequences being specified don't even appear in the fastqs in this case, so I expected them to have no effect.
Steps to reproduce:
# GiaB test data
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R{1,2}_001.fastq.gz
# fastp 0.23.4
wget http://opengene.org/fastp/fastp.0.23.4
chmod u+x fastp.0.23.4
ln -s fastp.0.23.4 fastp
# proof that the adapter sequences are absent in the fastqs - so surely should have no effect?
for f in U0a_CGATGT_L001_R*; do echo $f; for a in CTGTCTCTTATACACATCT AGATGTGTATAAGAGACAG; do zcat $f | grep -c $a; done; done
# subset to a minimal example of 3 reads known to be affected
zcat U0a_CGATGT_L001_R1_001.fastq.gz | grep -E '^@HWI-D00360:5:H814YADXX:1:1101:(3756:2236|7206:2194|5147:4880)' -A 3 --no-group-separator | head -n 12 | gzip -c > minimal_r1.fastq.gz
zcat U0a_CGATGT_L001_R2_001.fastq.gz | grep -E '^@HWI-D00360:5:H814YADXX:1:1101:(3756:2236|7206:2194|5147:4880)' -A 3 --no-group-separator | head -n 12 | gzip -c > minimal_r2.fastq.gz
# run fastp with/without duplicated --adapter_sequence args
fastp -i minimal_r1.fastq.gz -I minimal_r2.fastq.gz -o r1_trimmed.fastq.gz -O r2_trimmed.fastq.gz
--thread 8 \
--adapter_sequence CTGTCTCTTATACACATCT \
--adapter_sequence AGATGTGTATAAGAGACAG \
--adapter_sequence AGATGTGTATAAGAGACAG \
--adapter_sequence CTGTCTCTTATACACATCT
fastp -i minimal_r1.fastq.gz -I minimal_r2.fastq.gz -o r1_trimmed_nodup.fastq.gz -O r2_trimmed_nodup.fastq.gz
--thread 8 \
--adapter_sequence CTGTCTCTTATACACATCT \
--adapter_sequence AGATGTGTATAAGAGACAG
The above example consists of three reads, which were each affected in the same way in both the minimal fastqs above and the full size ones:
I've been running fastp as part of a larger third-party pipeline (i.e. not written or maintained by me), and noticed that it was specifying adapter sequences multiple times on the command line:
I tried seeing what fastp would do without the duplicate arguments, expecting to get the same results:
But I found that in some cases my read lengths were now different - sometimes only r1 was affected, sometimes only r2, sometimes both. The adapter sequences being specified don't even appear in the fastqs in this case, so I expected them to have no effect.
Steps to reproduce:
The above example consists of three reads, which were each affected in the same way in both the minimal fastqs above and the full size ones:
Do you know what could be causing this? Is it an expected use-case to specify the same adapter sequence multiple times?