OpenGene / fastp

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
MIT License
1.91k stars 333 forks source link

Missing most reads after given r2 adapter #547

Open Flying-Doggy opened 8 months ago

Flying-Doggy commented 8 months ago

The r2 reads is specified with GTGAGTGATGGTTGAGGTAGTGTGGAG at 5'. So I use --adapter_sequence_r2 GTGAGTGATGGTTGAGGTAGTGTGGAG to identify the valid reads, and the command is shown below:

fastp \
--in1 ${sampleID}_R1.fq.gz \
--in2 ${sampleID}_R2.fq.gz \
--out1 ${sampleID}_clean_R1.fq.gz \
--out2 ${sampleID}_clean_R2.fq.gz \
--json ${sampleID}.json \
--html ${sampleID}.html \
--trim_poly_g --poly_g_min_len 10 \
--trim_poly_x --poly_x_min_len 10 \
--cut_front --cut_tail --cut_window_size 4 \
--qualified_quality_phred 30 \
--low_complexity_filter --complexity_threshold 30 \
--thread 10 \
--adapter_sequence_r2  GTGAGTGATGGTTGAGGTAGTGTGGAG

The result suggests only 1/6 reads(~6000000, R1+R2) remains, 5/6 reads were filtered because of short length.

Filtering result:
reads passed filter: 6881134
reads failed due to low quality: 60130
reads failed due to too many N: 0
reads failed due to too short: 30160650
reads failed due to low complexity: 1218
reads with adapter trimmed: 21527581
bases trimmed due to adapters: 2374478381
reads with polyX in 3' end: 472347
bases trimmed in polyX tail: 5270559

Then, I use grep to sum the number of reads with GTGAGTGATGGTTGAGGTAGTGTGGAG at R2 head and find 17606348 R2 reads are start with valid adapter. So I think there might be other reason caused missing reads.

zcat 0116A_R2.fq.gz | grep ^GTGAGTGATGGTTGAGGTAGTGTGGAG | wc -l
17606348

To discovery which factor caused low filtered reads, I removed --adapter_sequence_r2 GTGAGTGATGGTTGAGGTAGTGTGGAG and ran fastp again. However, much more reads were acquired.

fastp \
--in1 ${sampleID}_R1.fq.gz \
--in2 ${sampleID}_R2.fq.gz \
--out1 ${sampleID}_clean_R1.fq.gz \
--out2 ${sampleID}_clean_R2.fq.gz \
--json ${sampleID}.json \
--html ${sampleID}.html \
--trim_poly_g --poly_g_min_len 10 \
--trim_poly_x --poly_x_min_len 10 \
--cut_front --cut_tail --cut_window_size 4 \
--qualified_quality_phred 30 \
--low_complexity_filter --complexity_threshold 30 \
--thread 10 \
Filtering result:
reads passed filter: 30428586
reads failed due to low quality: 221476
reads failed due to too many N: 0
reads failed due to too short: 608
reads failed due to low complexity: 1332
reads with adapter trimmed: 5489442
bases trimmed due to adapters: 109359880
reads with polyX in 3' end: 819659
bases trimmed in polyX tail: 9294422

I want to know why many valid reads miss after adding --adapter_sequence_r2 GTGAGTGATGGTTGAGGTAGTGTGGAG . And in the output clean data, many reads still have adapter sequence.

zcat 0116A_clean_R2.fq.gz | grep ^GTGAGTGATGGTTGAGGTAGTGTGGAG | wc -l
3047290
Flying-Doggy commented 8 months ago

for example, target adapter 'GTGAGTGATGGTTGAGGTAGTGTGGAG' is located at the 5' of the E200012434L1C001R00100009337 read2. And it's expected to get trimmed reads2 like 'CGGGGTTATAGTGTGAGATTTTGTTTTAAGAATAAAAAAATTTTAAAATAAGATAATTTTATTTTTATATAAATTATTTTAGAGTATAATAAAAGGAAAATTTTTAAATTTATTATATAAAGT', but the fastp trimmed the whole E200012434L1C001R00100009337 read2. ---input---

@E200012434L1C001R00100009337/1
ACAACAAACCGAAATCGCGCCACTACACTCCAACCTAAAAAACAACGAAACTCCGTCTCAAAAAAAAAACAAAAAACAAAAAATTAAAATAAAACCAACTTTATATAATAAATTTAAAAATTTTCCTTTTATTATACTCTAAAATAATTT
+
GFGGFGGGFFGFGGGFGFGFGGFGGFGFGFGGGFGGGGGGGGFGGGGGGGGGGFGFGGGFGGGGGGGGGGGGGGGGFGGGGGGFGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGBFGGGGFGGGGGFFFGGGGGGFGCGG
**@E200012434L1C001R00100009337/2**
GTGAGTGATGGTTGAGGTAGTGTGGAGCGGGGTTATAGTGTGAGATTTTGTTTTAAGAATAAAAAAATTTTAAAATAAGATAATTTTATTTTTATATAAATTATTTTAGAGTATAATAAAAGGAAAATTTTTAAATTTATTATATAAAGT
+
FEFFFFFF?GFFFFFFFFFFFFCGFFFEGFFFFFGEGFFFFFFFF>FFFF=FGGGFGFFFFFFFFGFFCFFGFGFFFFGF>GFEFDFGFEFFGFEFFGAFFFG?FFFFFFF>G>GFFGGFGFDFFF.#E@CCFF?<FFGFFGGGFGFGFF

----fastp_output----

@E200012434L1C001R00100009337/1 paired_read_is_failing
ACAACAAACCGAAATCGCGCCACTACACTCCAACCTAAAAAACAACGAAACTCCGTCTCAAAAAAAAAACAAAAAACAAAAAATTAAAATAAAACCAACTTTATATAATAAATTTAAAAATTTTCCTTTTATTATACTCTAAAATAATTT
+
GFGGFGGGFFGFGGGFGFGFGGFGGFGFGFGGGFGGGGGGGGFGGGGGGGGGGFGFGGGFGGGGGGGGGGGGGGGGFGGGGGGFGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGBFGGGGFGGGGGFFFGGGGGGFGCGG
**@E200012434L1C001R00100009337/2 failed_too_short**

+