benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
140 stars 24 forks source link

Logic of the prevalence based method #32

Open mniku opened 5 years ago

mniku commented 5 years ago

Thanks of the great tool!

Regarding the prevalence method, do I get it right that it only considers the presence/absence of an ASV (etc) in samples and controls (rather than prevalences within each sample, understood as relative proportion of the ASV among all the sequence reads from that sample)? I find this quite simplistic, as it means that an ASV detected at very low level in all negative controls would be classified as contaminant even if it makes up most of the data in ALMOST all samples.

I do understand that the tool is not meant to deal with cross-contamination, but especially when working with human/animal microbiota, some of the real ASVs could well be present also as reagent/sampling instrument contaminants. It is of course problematic to use the MiSeq-based absolute counts for sample-wise comparison, as the method is not quantitative. And on the other hand, using relative abundances is problematic when the original biomass in controls differs from the samples. But still it seems a bit brutal to disregard the quantification data completely.

I would highly appreciate your comments on these logics.

benjjneb commented 5 years ago

Regarding the prevalence method, do I get it right that it only considers the presence/absence of an ASV (etc) in samples and controls (rather than prevalences within each sample, understood as relative proportion of the ASV among all the sequence reads from that sample)?

Correct, the prevalence method considers presence/absence patterns only, and does not consider the relative abundances (or proportion) of taxa within a sample.

I find this quite simplistic, as it means that an ASV detected at very low level in all negative controls would be classified as contaminant even if it makes up most of the data in ALMOST all samples.

That scenario would not lead to a contaminant classification. If an ASV is found in all real samples and in all negative controls, it would have the same prevalence in true samples and controls (i.e. be found in the same fraction of samples in each category). It would receive a score of 0.5 in that case, and not be classified as a contaminant under any normal P* threshold.

I do understand that the tool is not meant to deal with cross-contamination, but especially when working with human/animal microbiota, some of the real ASVs could well be present also as reagent/sampling instrument contaminants.

It is true that decontam is not meant to remove cross-contaminants, but it is designed to be more robust to falsely removing true taxa that are present in negative controls due to cross-contamination than ad hoc approaches like removing taxa present in >X negative controls. This is achieved by requiring the fraction of negative controls in which a taxa is found to be higher than the fraction of true samples in which it is found. This will work well to avoid false-positives in the canonical index-switching form of cross-contamination, as index-switches will happen with equal probability from a true sample to another true sample or negative control.

mniku commented 5 years ago

Thanks for the comment. However I would still like to ask about your reasons NOT to take into account the actual abundances, instead of bare presence/absence. Also, I don't completely understand how an ASV with equal prevalence in controls and samples could be regarded as non-contaminant?

benjjneb commented 5 years ago

Thanks for the comment. However I would still like to ask about your reasons NOT to take into account the actual abundances, instead of bare presence/absence.

The frequency method uses the abundance information (transformed to proportions), and will happily make use of negative control data as well. The prevalence method is a presence/absence method, and is based off of many literature reports as well as testing that contaminants are present in a higher fraction of negative controls than in true samples.

Also, I don't completely understand how an ASV with equal prevalence in controls and samples could be regarded as non-contaminant?

In theory? or in the method? In theory, that could be a cross-contaminant, so the conservative classification would be as a non-contaminant. In the algorithm, that ASV would be assigned a score of 0.5, so as long as P* is <=0.5, it would be classified as a non-contaminant.

mniku commented 5 years ago

Thanks, I'm happy now!

st01565 commented 5 years ago

Hi Ben, I struggle with understanding the 2x2 table that calculates the chi2 values for the prevalence based method. Could you perhaps show how some fake data for an ASV would be put into a table?

Cheers,

Rune

benjjneb commented 5 years ago

@st01565 2x2 table of present/absent in controls/non-controls