Reproducing DADA2 output of the Extreme data set

Hi, I am trying to reproduce the DADA2 output of the extreme data set mentioned in DADA2: High-resolution sample inference from Illumina amplicon data, 2016 . I obtained the data set using the accession number SRX1478507. I downloaded the reference sequences from the following repository: (https://purl.stanford.edu/mh194vj6733). Following questions arised during this:

The reference sequences (ExtremeRefSeqs.fasta) provided in the mentioned repository contains different reference sequences for the same species. Some of them do not differ in their V4 region. Were all species listed in the reference sequences included in the protocoll, even if they have the same V4 region? (Or did i mess up extracting the V4 region?)
Additionally, when producing the ASV table with DADA2 and trying to align/ match the ASV to the reference sequences (ExtremeRefSeqs.fasta), i was not able to get a perfect match for Howardella ureilytica . Did you observe something similar in your analysis? In the supplementary material of your paper you mentioned some limitations of DADA2 regarding sequences that have very few error-free reads especially so if they are close to other sample sequence. Could it be a similar problem with this species?
Finally, i also observed 2 ASV's, which do not match with any sequences provided in the reference sequences. I used BLAST to check them and one of them mapped to an uncultured clone, while the other mapped to Anaerostipes caccae. Is this something you observed aswell?

Thank you Nicolas

The reference sequences (ExtremeRefSeqs.fasta) provided in the mentioned repository contains different reference sequences for the same species. Some of them do not differ in their V4 region. Were all species listed in the reference sequences included in the protocoll, even if they have the same V4 region? (Or did i mess up extracting the V4 region?)

All unique sequences at the full-length level (or near full-length) that we had for each of these strains were included in ExtremeRefSeqs.fasta. We did not condition on them being different in the V4 region to be included in that file.

Additionally, when producing the ASV table with DADA2 and trying to align/ match the ASV to the reference sequences (ExtremeRefSeqs.fasta), i was not able to get a perfect match for Howardella ureilytica . Did you observe something similar in your analysis? In the supplementary material of your paper you mentioned some limitations of DADA2 regarding sequences that have very few error-free reads especially so if they are close to other sample sequence. Could it be a similar problem with this species?

I don't remember about H. urealityica specifically, but yes there were several strains in the Extreme mock community that were not detected. The manuscript text states there were 27 strain in Extreme, and Table 1 says DADA2-merged only detected 21 of them. That is partly due to a couple strains that didn't produce any reads at all in the final sequence data, and DADA2 failing to detect some very low abundance strains present in just O(1) reads.

Finally, i also observed 2 ASV's, which do not match with any sequences provided in the reference sequences. I used BLAST to check them and one of them mapped to an uncultured clone, while the other mapped to Anaerostipes caccae. Is this something you observed aswell?

Yes. We described these as "Exact" ASVs in the manuscript, see for example Table 1. Our interpretation is that these are contaminants, which are a fundamentally different type of error than DADA2 or other ASV/OTU methods are attempting to fix. You can read more about our thoughts and a method to deal with contaminants here: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0605-2

benjjneb / dada2

Reproducing DADA2 output of the Extreme data set #1361