OpenGene / fastp

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
MIT License
1.91k stars 333 forks source link

Trimming adapter sequence with Ns #331

Open omarwagih opened 3 years ago

omarwagih commented 3 years ago

One of the fastq files I'm processing was carried out using the NEXTflex™ Small RNA-Seq Kit for library prep which uses an adapter sequence with Ns in it

NNNNTGGAATTCTCGGGTGCCAAGG

I tried passing this through fastp --adapter_sequence but I get the error

ERROR: the adapter <adapter_sequence> can only have bases in {A, T, C, G}

I also tried trimming the non-N version of the adapter then trim 4 bases off the tail but it seems fastp trims the tail first then trims the adapter so this doesn't work

--adapter_sequence=TGGAATTCTCGGGTGCCAAGG --trim_tail1 4

Is there any way of processing this fastq file using fastp?

Thanks!

riederd commented 3 years ago

I'd also be very interested in having that feature e.g. --adapter_sequence_r2 NNNNNNNNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGA should remove an extra 10 bases Any planes to implement this? Trimgalore seems to support Ns in the adapters. BTW

Thanks!

sfchen commented 3 years ago

Will consider to implement this feature.

Could you guys let me know what's the NNNNNNNNNN designed for ? UMI or barcodes in single-cell sequencing?

riederd commented 3 years ago

Here is my use case: Notes from the Zymo-Seq RiboFree® Total RNA Library Kit:

The Zymo-Seq RiboFree® Total RNA Library Kit employs a lowcomplexity bridge to ligate the Illumina® P7 adapter sequence to the library inserts. This sequence can extend up to 10 nucleotides. QC analysis software (e.g., FastQC) may raise flags such as “Per base sequence content” at the beginning of Read 2 due to this low complexity bridge sequence.

I hope this answers your question.

Thanks

omarwagih commented 3 years ago

In my case it’s just a 3’ sequencing adapter

kubu4 commented 2 years ago

To add to what @riederd posted, here's additional info from the Zymo-Seq RiboFree suggestion:

If desired, these 10 nucleotides can be removed in addition to adapter
trimming. An example using Trim Galore!

for such trimming is as below:

trim_galore --paired --clip_R2 10 \
-a NNNNNNNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
-a2 AGATCGGAAGAGCGTCGTGTAGGGAAAGA \
sample.R1.fastq.gz \
sample.R2.fastq.gz

Related to this, it would be great if fastp allowed for trimming an additional n bp after adapter trimming, similar to the functionality that the trim_galore command provides above.

Currently, it looks like fastp --trim_front and --trim_tail are steps 2 and 3 (respectively) in the order of operations, which come well before the adapter trimming occurs.

In the current fastp configuration, this means if one wanted to trim the adapters and then trim an additional n bp from the trimmed reads, the user would have to initiate a second round of fastp.

blostein commented 10 months ago

This use case comes up for other kits as well, such as trimming an adaptase tail resulting from xGen library prep kits for which the manufacturer notes: Illumina adapter trimming must be performed before ... Adaptase tail trimming

a1ultima commented 9 months ago

This use case came up for us too.

In the case similar to mentioned by @blostein, In our case the sequencing partner would say:

Indexed adapter sequences

The full-length adapter sequences are below. The underlined text indicates the location of the index sequences, which are 8 bp for CDI and 10 bp for UDI. These sequences represent the adapter sequences following completion of the indexing PCR step.

Index 1 (i7) adapters 5-GATCGGAAGAGCACACGTCTGAACTCCAGTCACXXXXXXXX(XX)ATCTCGTATGCCGTCTTCTGCTTG–3 Index 2 (i5) Adapters 5-AATGATACGGCGACCACCGAGATCTACACYYYYYYYY(YY)ACACTCTTTCCCTACACGACGCTCTTCCGATCT–3

Which leaves us puzzled as to what the ambiguous X and Y characters were.

Being able to declare wildcards or known length NNNNN+ subsequences would immediately solve our problem since:

We cannot:

lpantano commented 2 months ago

This will come handy for some small RNAseq protocol where they have 4N before the adapter. Corrently, two rounds of fasp are needed to prepare the data. It would be great to have that. Just adding my +1 from the nextflow/nfcore community.