benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
147 stars 25 forks source link

Can I use decontam with these data? #101

Open lyonskvn opened 3 years ago

lyonskvn commented 3 years ago

Hi there,

I'm very excited to try using decontam, but I'm not sure it'll work with our data. I'd be grateful for any advice.

I've got the following phyloseq object (16 true samples and 5 control samples):

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 18039 taxa and 21 samples ]
## sample_data() Sample Data:       [ 21 samples by 5 sample variables ]
## tax_table()   Taxonomy Table:    [ 18039 taxa by 7 taxonomic ranks ]

True samples: For each sample, we subjected a large volume of groundwater (~200 L) to ultrafiltration. Material stuck to the filter was backflushed with a 'backflush' solution. DNA and RNA were extracted from this 'backflush' solution. DNA and cDNA libraries were prepared according to the Illumina MiSeq protocol, and 2x300 bp paired-end 16S rRNA amplicon sequencing of the V3–V4 region was carried out on the Illumina MiSeq platform.

Controls: Most are 'backflush' (BF) controls (i.e. same processing as samples but with unused 'backflush' solution). I think the other control (extctrl_DNA) was just to check for contamination coming from the extraction kit itself. I'm not 100% sure of the details, because the extractions were performed by a collaborator. I should probably check this.

Long story short, however, our data seem quite different to the data used in the decontam vignette:

Issue 1: Some of our control samples have rather large library sizes.

Presentation2

Issue 2: I don't have specific quant_reading values for 4 samples and 2 controls, because the values were below detection limits.

Should I just give up now? Or is there still a way to use decontam on these data?

Could I try using the 'prevalence' method? Or do you see some red flags that I don't? I'm a novice.

Really appreciate your time!

Kevin

benjjneb commented 3 years ago

I applaud the negative controls you collected. It sounds they are (or are very close to) full sampling instrument controls, which are the best kind.

Considering your results, I would also have some serious concerns about the contamminant fraction of your data. Expectations with control samples are that they will be low enough in starting DNA concentration that even after the library normlaization step that they will produce less reads. That did not happen here at all, and in fact two control samples had the highest read counts.

lyonskvn commented 3 years ago

Thanks for your response, @benjjneb. Do you have any advice on how to proceed? Can I use decontam on these data?

benjjneb commented 3 years ago

I would have some serious reservations about using decontam on this data. decontam isContaminant implicitly assumes that most of the reads in real samples are coming from sample DNA. Given the lack of separation in the read counts between control and real samples here, that assumption is questionable. That said, I don't see data on the measured DNA concentrations of each sample, is that higher in samples than controls?

I'm not sure the prevalence method is the answer here, because the number of control samples isn't that large given that this data suggests that most of even the real samples are contamination.

I do think decontam can usefully rank your taxa as more or less likely to be contaminants, and a further manual inspection of the highest and lower ranked such taxa could be helpful (see for example our analysis of placental microbiome data in the decontam paper).

Rob-murphys commented 1 year ago

Sorry to necro this thread but I am in a similar situation. My controls have very large library sizes despite low read counts when looking at the demultiplexed data and low initial DNA concentrations per pooling for sequencing.

I am curious about your statement:

isContaminant implicitly assumes that most of the reads in real samples are coming from sample DNA. Given the lack of separation in the read counts between control and real samples here, that assumption is questionable.

Why does the figure above question this assumption? That assumption would come into question if the features between controls and samples overlapped a lot but that may not be the case? From looking at my feature table I can see that my controls have A LOT of reads in very few features.

Given my low read count of controls vs samples and then low DNA concentration of controls vs samples am I okay to use decontam do you think?

benjjneb commented 1 year ago

Why does the figure above question this assumption? That assumption would come into question if the features between controls and samples overlapped a lot but that may not be the case? From looking at my feature table I can see that my controls have A LOT of reads in very few features.

Given my low read count of controls vs samples and then low DNA concentration of controls vs samples am I okay to use decontam do you think?

This additonal information suggests that your negative controls are informative, and are representing a different (presumably the contaminant-part) of the mixture in your real samples. So given that additoinal information, I would probably go ahead and apply the decontam prevalence method. That said, I would probably at least poke at the top few identified contaminants to see if they "make sense" -- aren't expected taxa in the environment.

Rob-murphys commented 1 year ago

That makes sense. Thank you :)

What if one is not to sure what should be considered expected in that environment given it has been sparsely sequenced in the past?

benjjneb commented 1 year ago

One does what they can I suppose.