benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
461 stars 142 forks source link

sparsity of dada2 generated tables #98

Closed wdwvt1 closed 8 years ago

wdwvt1 commented 8 years ago

I have been working with dada2 recently and it really seems like a great improvement over the open reference OTU picking approach. I have run into the same question on a couple of datasets and would like to solicit feedback from the community. Apologies for using the issue tracker for a technical/biological question, but unsure where else to get high quality feedback (or spark a helpful discussion like issue #62).

For a given dataset, the percentage of total features that are found in only one sample (unique to a sample) is much higher for dada2 than for open reference OTU picking (when open reference singletons have been removed). In my tests (outlined below) its been 2-8x. For other data I have looked at its been similarly high.

Case 1 - 1166 samples across several sequencing runs, dsv's found with dada2 according to the pipeline outlined in the tutorial including phix removal and paired end joining and no pooling of reads with different error histories.

method samples features unique_feature_percent uf_seq_mass_percent
dada2 1166 22850 77.31 6.90
97%_open_ref 1164 3249664 93.95 10.78
97%_open_ref_mc2_pynast 1164 301359 12.57 .917

Case 2 - 604 samples across a single sequencing run, same workflow as above.

method samples features unique_feature_percent uf_seq_mass_percent
dada2 604 8464 79.91 3.00
dada2_pynast 604 8117 79.86 2.94
97%_open_ref 603 76359 82.45 1.21
97%_open_ref_mc2_pynast 603 21999 40.97 .746

mc2 refers to the removal of singleton OTUs (OTUs whose total count is 1). pynast refers to the removal of features whose representative sequence does not align to the GreenGenes 85% alignment template. unique_feature_percent is the number of unique features divided by the total number of features uf_seq_mass_percent is the total counts of features who are unique to a sample divided by the total counts

I have a lot more confidence that the unique features from the dada2 table are real, but they still make a lot of analyses hard. In particular, the reduction in shared features reduces the quality of the machine learning approaches we've applied to traditional OTU tables. We have thought about a couple strategies:

  1. Just remove features unique to a sample.
  2. Cluster the dada2 sequences resultant sequences at some percentage (say 99%). In a basic test this served to reduce by about ~50% the number of features.
  3. Collapse features on taxonomy (this seems pretty suboptimal given the massive loss in resolution, but commonly works well for a subset of things).

What do you suggest @benjjneb? Would love any insight or feedback you have, and am happy for discussion with everyone in general?

Best, Will

spholmes commented 8 years ago

Will, Just my grain of salt, the output from dada2 cannot be treated in the same way as the old fashioned OTUs as there are fewer of them and they are much higher resolution. Our workflow always involves: 1)Generating the dada2 table, idenitfying as much as possible the taxa from the references (and when you really care: blasT) 2)Save this as phyloseq object. 3) Then filter using the phyloseq filters that allow for instance to retain only RSV's that occur (have at least 2 read) in at least 3 sample. 4) Carry forward with the machine learning methods. Then we often redo 3) and 4) changing the filters to see how robust the results and how tuning the filters (koverA functions) changes the results.

We learn alot by iterating 3) and 4) both about what constitutes the core and the interpretability.

Hope this helps, Susan

On Wed, Jul 27, 2016 at 7:53 PM, Will Van Treuren notifications@github.com wrote:

I have been working with dada2 recently and it really seems like a great improvement over the open reference OTU picking approach. I have run into the same question on a couple of datasets and would like to solicit feedback from the community. Apologies for using the issue tracker for a technical/biological question, but unsure where else to get high quality feedback (or spark a helpful discussion like issue #62 https://github.com/benjjneb/dada2/issues/62).

For a given dataset, the percentage of total features that are found in only one sample (unique to a sample) is much higher for dada2 than for open reference OTU picking (when open reference singletons have been removed). In my tests (outlined below) its been 2-8x. For other data I have looked at its been similarly high.

Case 1 - 1166 samples across several sequencing runs, dsv's found with dada2 according to the pipeline outlined in the tutorial http://benjjneb.github.io/dada2/tutorial.html including phix removal and paired end joining and no pooling of reads with different error histories. method samples features unique_feature_percent uf_seq_mass_percent dada2 1166 22850 77.31 6.90 97%_open_ref 1164 3249664 93.95 10.78 97%_open_ref_mc2_pynast 1164 301359 12.57 .917

Case 2 - 604 samples across a single sequencing run, same workflow as above. method samples features unique_feature_percent uf_seq_mass_percent dada2 604 8464 79.91 3.00 dada2_pynast 604 8117 79.86 2.94 97%_open_ref 603 76359 82.45 1.21 97%_open_ref_mc2_pynast 603 21999 40.97 .746

mc2 refers to the removal of singleton OTUs (OTUs whose total count is 1). pynast refers to the removal of features whose representative sequence does not align to the GreenGenes 85% alignment template. unique_feature_percent is the number of unique features divided by the total number of features uf_seq_mass_percent is the total counts of features who are unique to a sample divided by the total counts

I have a lot more confidence that the unique features from the dada2 table are real, but they still make a lot of analyses hard. In particular, the reduction in shared features reduces the quality of the machine learning approaches we've applied to traditional OTU tables. We have thought about a couple strategies:

  1. Just remove features unique to a sample.
  2. Cluster the dada2 sequences resultant sequences at some percentage (say 99%). In a basic test this served to reduce by about ~50% the number of features.
  3. Collapse features on taxonomy (this seems pretty suboptimal given the massive loss in resolution, but commonly works well for a subset of things).

What do you suggest @benjjneb https://github.com/benjjneb? Would love any insight or feedback you have, and am happy for discussion with everyone in general?

Best, Will

Clearly, removing

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/98, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvSFefQ2ccVtA00uvQhSUjh6ahdpUks5qaBmOgaJpZM4JW0fq .

Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

wdwvt1 commented 8 years ago

Hi Susan, Thanks very much for the reply - that makes a lot of sense.

What kind of artifacts do you think it introduces to create 99% OTUs (or some similar clustering) on top of the RSV's? For our application, having the exquisite resolution that dada2 can provide might not be necessary.

benjjneb commented 8 years ago

Hi Will,

Depending on your intended analysis removing the sequence variants that only appear in one read is a sensible approach, and on large datasets such as you are working with it is part of my standard workflow.

Single-sample features are unlikely to be informative to classification tasks, as by definition they can only differentiate between the single sample they are in and all others. Thus, if you are doing classification on the table, they serve largely as nuisance columns.

There is a second dada2-specific reason why removing single-sample variants is reasonable. The default dada2 workflow identifies sequence-variants from each sample independently. This is a valid procedure, even though dada2 is a denovo method, because exactly inferred sequences are consistent labels (unlike eg. denovo_OTU_22). This means that FPs, although very low on a per-sample basis, are largely uncorrelated across samples.

This property means that most of the DADA2 FPs will be found in just one sample, which is contributing to the pattern you are seeing. While this inflates the total-study number of OTUs it also makes prevalence filtering (eg. >1 sample) very effective at removing residual FPs.

Hope that helps. In short, given what you said you're trying to do, I would probably just remove the single-sample variants. Clustering at a coarser level is also a valid approach, as long as you are willing to sacrifice the resolution.

PS: dada2 can also be run in pool=TRUE mode, in which case all samples are pooled together for sample inference as in standard denovo OTU approaches. This mode is more computationally taxing, however, and currently scales out to Miseq-scale data, but not Hiseq-scale.