ImagoXV / NanoASV

NanoASV official repo
GNU General Public License v3.0
3 stars 0 forks source link

Chimera detection #2 #27

Open ImagoXV opened 8 months ago

ImagoXV commented 8 months ago

Vsearch seems to never detect chimera with default parameters.

I think it lies on the fact that sequences are not dereplicated and therefore do not have a "count" section in fasta header. However, I think dereplication might not work because vsearch expects 100% similarity. Which is rarely (if not) achieved with nanopore amplicon sequencing. Efficient dereplication would come from accepting a certain variability threshold that would end up being clustering. Such clustering with vsearch performs well with a --id 0.7. Which is significantly lower than what we would want to accept for dereplication. If clustering, then it's not ASV treatment anymore.

I need to discuss it with you @frederic-mahe

ImagoXV commented 8 months ago

vsearch log

Reading file /tmp/.tmp_NanoASV/FILTERED_barcode20.fastq.gz 100%
181569135 nt in 115928 seqs, min 1300, max 1700, avg 1566
Masking 100%
Sorting by abundance 100%
Counting k-mers 100%
Detecting chimeras 100%
Found 0 (0.0%) chimeras, 115928 (100.0%) non-chimeras,
and 0 (0.0%) borderline sequences in 115928 unique sequences.
Taking abundance information into account, this corresponds to
0 (0.0%) chimeras, 115928 (100.0%) non-chimeras,
and 0 (0.0%) borderline sequences in 115928 total sequences.
frederic-mahe commented 7 months ago

If clustering, then it's not ASV treatment anymore.

You are right. Dereplication of identical sequences is easy, whereas approximate dereplication is hard. That's an issue with NanoPore sequencing: reads need to be dereplicated for the de novo chimera detection, but dereplication requires a certain level of grouping (ASV, or OTU, or taxo-grouping).

ImagoXV commented 7 months ago

Yep, so I guess the chimara detection step is totally useless in that context.

I guess we might detect chimera after alignment ? Allowing to have a common base for "dereplication" like information.

I don't know.

frederic-mahe commented 7 months ago

We need to look at the new chimera detection algorithm in vsearch. It should now work for sequences with errors. Although, I think it still requires for sequences to have abundance values, so a form of clustering before chimera detection.

ImagoXV commented 7 months ago
ImagoXV commented 6 months ago

https://github.com/ImagoXV/NanoASV/tree/clustering_chimera_detection