Unable to replicate community composition of Mock-3

adityabandla commented 7 years ago

Hi,

I have been trying to benchmark my 16S pipeline (a combination of QIIME and UPARSE) using the mock-3 dataset. However, after several iterations (with different parameter values or databases), the relative abundances seem to be highly skewed towards Staphylococcus (approx. 50% in the "even" and 70% in the "staggered").

The number of OTUs being detected is always close to the original number of strains. Even with the values used in your articles, I do not seem to get the expected composition

nbokulich commented 7 years ago

That is normal — I have yet to see a mock community that actually replicates 100% perfectly. Due to amplification bias, sequencing bias, and other technological errors (including possible contamination or human error during its construction), it is highly improbable for a mock community to be 100% accurate. For practical purposes, when using these to benchmark methods, your "best" methods are those that give the closest result.

Does that make sense? Please let me know if you have any more questions.

adityabandla commented 7 years ago

Thank you for the super quick reply! Is there a reference OTU table for this dataset, possibly computed using best values deemed by your group or a pipeline that you used? This would be of great help for anyone who is trying to use these datasets to calibrate their pipelines

Essentially I am looking for a OTU counts table against which I can compare the OTU table that I generated

nbokulich commented 7 years ago

No, sorry — we recognize that this could be useful but intentionally do not post processed data here for a few reasons. First, this is to give users the most control and flexibility. Second, we could not post a "best method" example without thoroughly benchmarking and publishing those methods.

That said, you may be interested in this preprint and this GitHub project, in which we describe a standardized framework for comparing different taxonomy methods. That contains OTU tables with and without taxonomy assignments but those are NOT necessarily processed using the best possible pre-processing methods (e.g., quality filtering and OTU picking), so I do not recommend using them as a "gold standard" for methods benchmarking against other pre-processing methods.

If you are testing different methods that take an OTU table as input, you may find these materials useful. Otherwise, it would be very easy to find a method that performs better than these for, e.g., OTU picking, as these tables were generated with fairly permissive methods.

For mock-3, you can find the taxonomy-free BIOM table and reference sequences here. Most other mock communities in mockrobiota are also used there, and you can find this information in the dataset-metadata.tsv files.

caporaso-lab / mockrobiota

Unable to replicate community composition of Mock-3 #50