ablab / spades

SPAdes Genome Assembler
http://ablab.github.io/spades/
Other
752 stars 136 forks source link

Minority mutations used for final assembly (metaspades, hybrid ONT-SR) #1350

Open lborcard opened 2 months ago

lborcard commented 2 months ago

Description of bug

We performed an hybrid assembly using metaspades (ONT + SR), we had really huge coverage. Nonetheless, one of the final we discovered that there were 5 minority SNPs that were present in the final assembly instead of the majority ones.
SR_Assembly_igv_5332-5371

spades.log

SPAdesHybrid-TBEV-Neud.log

params.txt

params.txt

SPAdes version

SPAdes version: 3.15.3

Operating System

Linux-4.18.0-513.11.1.el8_9.x86_64-x86_64-with-glibc2.28

Python Version

Python version: 3.9.6

Method of SPAdes installation

container

No errors reported in spades.log

asl commented 2 months ago

Can you run with read error correction disabled? (--only-assembler).

Overall, since you are running in metagenomic mode, this is expected, as the assembler assumes that there are multiple strains and the result is a consensus assembly.

lborcard commented 2 months ago

What is surprising is that it chose the lower variant, which is not the consensus at all? Using medaka from nanopore picked the correct variant. edit: To add to the first point it only happened for this sample and was caught later in the process. at which level should one verify that this is not happening? Does spade create a VCF of some sort?

lborcard commented 2 months ago

sorry to insist but we would really like to understand this unexpected behaviour.

asl commented 1 month ago

It is very important to understand is that assembly does not take "consensus" from the reads. This is neither feasible nor intended.

This is especially true in metagenomics mode as assembler also tries to collapse strain differences to produce a back-bone assembly for a metagenome (and note that output of a metagenome assembly is not a combined assembly of individual species, e.g. some between-species variation could be collapsed, but then these repetitive sequences further resolved).

Even more, assembler does not operate on the span of individual nucleotides, so it does not know about "rare variants", etc., it is not a variant calling problem as reference is not available at all.

In your case you can try to disable read error correction, this might help (--only-assembler), but in general, do not expect nucleotide-level resolution from metagenomic assemblies.

See https://pubmed.ncbi.nlm.nih.gov/28298430/ for more information of metagenomic assemblies methods & output.