benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
474 stars 143 forks source link

Reads alternately splitted between closely related ASV sequences - technical replicates / time series. #1669

Open fconstancias opened 1 year ago

fconstancias commented 1 year ago

Hi @benjjneb,

Thanks for your efforts developping dada2 which contributed to revolutionise the analysis of 16S data.

We are working on the dynamic of fecal microbiota using in vitro systems and noticed that for some of the samples - technical replicates or consecutive samples in time-series - reads are alternately entirely assigned to one of two very closely related ASVs.

We noticed this issue due to very low prevalent ASV (or even only 1 in the whole dataset of 575 samples). We found examples like the one below from technical triplicates part of the same library.

E5E2B89C-4A03-4471-A711-3ABF0C5E595D

The two ASV sequences are very similar:

013B5C33-CB5F-438F-BBFF-33485B3F0434

A similar behaviour has been documented in this recent preprint - see supplementary information.

Technical details:

Library preparation: Latest EMP single-indexed 16S V4 protocol

Sequencing: Miseq 2 x 250 bp V2

Dada2 pipeline:

We combined several run - sequenced on the same machine and the same parameters were applied on the sequences generated on those Miseq runs before merging the ASV tables as described here. Additional clustering using dada2::collapseNoMis() was aplied to identify potential artefacts.

Overall quality profile of the run:

Ferm3_forward.pdf

Ferm3_reverse.pdf

Error profiles:

errors_Ferm3_fwd.pdf

errors_Ferm3_rev.pdf

Sequence summary of the three samples in highlighted above:

summary_example.xls

Any idea what could be happening here?

Thanks for the support.

fconstancias commented 1 year ago

I have checked/removed the presence of Illumina adaptors which did not change the pattern.

Changing the pooling strategy from "pseudo" to TRUE generated more consistent ASV profiles between the replicates.

See below (left :pool = "pseudo", right : pool = TRUE). I am only running the pipeline on those 3 replicates.

Screenshot 2023-01-18 at 10 57 07
benjjneb commented 1 year ago

If I am reading your images right, the problem ASVs/taxa is Tanneralacea/Parabacteroides, which looks like it is getting split into one of two ASVs depending on the sample?

I doubt this has anything to do with sequence quality (assuming all of these were sequenced on the same run). More likely there is something in the technical operation of the method, e.g. in alignment of heuristics, that is being inconsistent. Since no other taxa/ASVs seem affected, it doesn't seem likely to be an unresolved issue with sample-specific sequence laying around.

When you compare these two ASVs, what kind of differnces do they have between them? I wonder if something touchy like near-equal alignments might be introducing a gapping difference.

If you run collapseNoMismatch on the table, do these two ASVs get collapsed together?

fconstancias commented 1 year ago

If I am reading your images right, the problem ASVs/taxa is Tanneralacea/Parabacteroides, which looks like it is getting split into one of two ASVs depending on the sample?

Yes that's correct. We also observed less shared ASVs between the samples (DNA extraction replicates) on the left table (pool = "pseudo").

Assuming all of these were sequenced on the same run.

They are indeed coming from the same library/ run and were processed similarly (dada2::filterAndTrim() ... )

When you compare these two ASVs, what kind of differnces do they have between them?

The hamming distance between ASV sequences is 2, the ASV have the same length of 253nt. We can see the following differences comparing ASV04040 and ASV0467:

211290668-5007b8bd-4ffe-474d-9d56-ecfa0ff87357

| If you run collapseNoMismatch on the table, do these two ASVs get collapsed together?

We run dada2::collapseNoMismatch()as part of our dada2 pipeline, so this should not be. Additionally, the 2 ASV sequences cluster together when using threshold of 99% DECIPHER::Clusterize(cutoff= 0.01) not at 100%.

This pattern is reproducible when running the pipeline different times and also when processing only those three samples as well as with other samples (from the same run or including other - keeping independant ASV inference per run).

I am surprised to get rid of the pattern using pooling pool = TRUE (right sheet on the printscreen) instead of pseudo-pooling (left-sheet on the printscreen). Any intermediate files would be worth checking to understand what is going on here?

benjjneb commented 1 year ago

My guess is still that it is something to due with (near) identical alignments, but perhaps not to each other but to a more abundant ASV in each sample from which these ASVs are being split. There are diagnostics that can be used to look into that, but this starts to get into the undocumented guts of DADA2.

Would you be willing to try to put together a "minimal example", maybe just 2-3 samples that reproduce this behavior, and the basic accompanying code? If I can reproduce on my end, there are more threads to pull.

fconstancias commented 1 year ago

sure, here the link to get files/comands.

Let me know if you need any additional information.

fconstancias commented 1 year ago

Hi @benjjneb, were you able to reproduce the behaviour?

AMCMC commented 1 year ago

Hi,

I think this is a similar issue I had with ITS data. https://github.com/benjjneb/dada2/issues/553

If you would go trace the raw sequences that map to these ASVs you will likely see they will be very similar in distribution but the most abundant sequence (i.e the representative ASV) will differ.

This issue is not observed for pool=True because all sequences get assigned to the total most abundant sequence from the entire dataset, while with pseudo pooling the most abundant sequence of the sample is selected.