bonsai-team / Porechop_ABI

Adapter trimmer for Oxford Nanopore reads using ab initio method
GNU General Public License v3.0
36 stars 4 forks source link

Consensus sequences #22

Open sihanbu opened 1 day ago

sihanbu commented 1 day ago

Hello,

I'm new to whole genome assembly using ONT reads. I have two questions.

Suppose the consensus sequences identified during the inference are not adapters but are important to the project (e.g., specific sequences I'm studying). In this case, how would I ensure they are not mistakenly inferred and trimmed?

I’m curious about why unknown adapters (consensus sequences) appear during ONT sequencing. Since the adapter sequences used are already known, shouldn't we be able to trim them off directly? Where do these unknown adapters originate from?

Thank you for your assistance!

Best, Sihan

qbonenfant commented 23 hours ago

Hi About your first question, since it can be quite hard to filter common patterns from adapters, we added an option to exclude a list of k-mers from the counting phase. This is not perfect, but will work fine if you need to prevent trimming of a specific sequence. Look for the "forbid_kmer" option of the configuration file.

Now, why would "unknown" adapters appear during ONT sequencing? The answer is quite simple: Oxford Nanopore Technology do not publicly disclose the adapter sequences, or at least not outside of the ONT community from what i have seen.

The only known database for ONT adapters when we published our paper was the original Porechop database (adapters.py) curated by Ryan Wick and other members. This database is no longer maintained since 2018, so any new adapter is basically unknown.

It seems ONT is doing this on purpose, since recent ONT basecallers (guppy, dorado, and others) are supposed to trim the reads during the basecalling. Being based on neural network, those tools trimming step is basically a black box for us. It makes them pretty difficult to trust, and their effectiveness is hard to evaluate without the adapter sequences to compare.

Our study revealed (at least for guppy) that residual (known) adapter sequence can be found in public dataset processed by ONT basecallers. This is why tools such as Porechop_ABI are needed to clean datasets, or at least for quality control.

Disclaimer I have been out of bioinformatics ressearch for 2 years now, and even if I keep reading papers from time to time, you should take my statements with a grain of salt. ONT may have changed it's policy recently (and i may be unaware of this), or maybe their basecallers are perfect now ? Who knows? Not me for sure.