benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
147 stars 25 forks source link

Input of isContaminant(): relative or absolute count table of ASVs/OTUs? #45

Closed antonioggsousa closed 5 years ago

antonioggsousa commented 5 years ago

Hello,

I'm using by the first time the decontam package. I'm trying to reproduce the decontam tutorial (link: https://benjjneb.github.io/decontam/vignettes/decontam_intro.html ) with my own data. My question is related with the input table of the <isContaminant()> function. How the data should be transformed: relative or absolute count table of ASVs/OTUs?

The tutorial and the official repository mention that "The first decontam ingredient is a feature table derived from your raw data, i.e. a table of the relative abundances of sequence features (columns) in each sample (rows)." However when I tried to reproduce the tutorial mentioned before, the phyloseq object ps contains absolute counts of ASVs and it is this table that is given as input of <isContaminant()>, or I missed some step?

Thanks in advance! Cheers, António

benjjneb commented 5 years ago

Either format. By default the provided table is normalized to proportions within the function itself (see the normalize parameter in isContaminant).

The tutorial and the official repository mention that "The first decontam ingredient is a feature table derived from your raw data, i.e. a table of the relative abundances of sequence features (columns) in each sample (rows)." However when I tried to reproduce the tutorial mentioned before, the phyloseq object ps contains absolute counts of ASVs and it is this table that is given as input of <isContaminant()>, or I missed some step?

A bit of a technical detail: While read counts are absolute count values, they reflect the relative abundances (not the absolute abundances) of the taxa because the total read count in each sample is arbitrary. Hence "relative abundance" is being used here to refer to a table of read counts or of proportions.

antonioggsousa commented 5 years ago

Thanks @benjjneb!

bextra commented 3 years ago

This thread is helpful in understanding desired input with respect to normalization. At first, I ran into issues when providing a feature table with reads per million (RPM) as floating point relative abundances instead of an integer matrix. Is it better to work with raw read counts opposed to a prior normalization or decontam package normalization? From your documentation for normalization:

normalize (Optional). Default TRUE. If TRUE, the input seqtab is normalized so that each row sums to 1 (converted to frequency). If FALSE, no normalization is performed (the data should already be frequencies or counts from equal-depth samples).

However, doesn't normalizing to 1 go against the principles of CoDA which is more appropriate for metagenomic data?

bextra commented 3 years ago

tagging @benjjneb just in case since this is marked as closed :-) thank you in advance for your reply

benjjneb commented 3 years ago

However, doesn't normalizing to 1 go against the principles of CoDA which is more appropriate for metagenomic data?

Normalizing to 1 is no better or worse for CoDA analyses than the arbitrary sequence depth "normalizations" that count data would have. The point of CoDA is that results should be independent of the overall sample sum.

For decontam itself, proportions are fine, in part because of the assumption that the contaminant fraction of the sample mixture is small relative to the non-contaminant fraction. See the paper for more details on how that works: Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data