benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
144 stars 24 forks source link

very cross-contaminated samples #131

Open lauraDRH opened 1 year ago

lauraDRH commented 1 year ago

Hi!

I am trying to use this package to remove as much contaminants as I can from an already cross contaminated samples. For context, this data comes from NovaSeq sequencing, so I am working with ESVs. I have pcr and extraction negatives.

The number of reads is very variable, and as you can see the negatives have sometimes the same or more than the samples:

imagen

The histogram and the prevalence plot (0.5) look like this:

imagen imagen

At this point I am a bit lost on what to do... I can see it will be very difficult for me to remove many contaminants, but I would like to remove as much as I can.. What threshold would you recommend?

any help would be much appreciated!

benjjneb commented 1 year ago

What sample type is this? Is it a sample type in which many samples may be effectively "sterile" (i.e. have no true resident microbiome)?

The alternative approach to removing contaminants is to instead try to "rule things in" as non-contaminants, and only work in the fraction of ASVs that have solid evidence aren't just contamination.

The intermingling of your extraction negatives (which are vastly preferable to PCR negatives) and real samples in terms of read depth, and the long flat distribution of decontam scores after a small but noticeable mode at score ~ 1 (strong evidence not a contaminant) would be consistent with the scenario I described.

lauraDRH commented 1 year ago

Hi! thanks so much for your answer. The samples are extracted DNA from very small unicelular organisms, so they should not be sterile at all although they would have very little amount of total DNA because of their size. So, in the end it could make sense that a little contamination makes it look all the same.

That alternative approach seems interesting, as the isContaminant function is considering non-contaminants taxa that are clearly contaminants (associated with human skin for example). The only problem I have is that the sequencing depth is so much that I have hundreds of ESVs.

I have now used the combined method (threshold 0.1) and the number of contaminants have decreased significantly: imagen

Do you recommend using the batch approach? I would have 1 pcr negative for 95 samples and 6 extraction negatives for 300 samples.

benjjneb commented 1 year ago

The samples are extracted DNA from very small unicelular organisms, so they should not be sterile at all although they would have very little amount of total DNA because of their size.

The DNA is from isolated unicellular organisms? And you are amplifying some gene from there? Or this is some other sample type, and you are amplifying a gene that is a marker for some class of small unicellular organisms of which you are interested?

the isContaminant function is considering non-contaminants taxa that are clearly contaminants (associated with human skin for example)

That would be a red flag for this sequencing data perhaps. When you look at the ASVs with the highest scores (i.e. the best evidence are not contaminants) what do you see? What you expect? Or, skin microbes?

Do you recommend using the batch approach? I would have 1 pcr negative for 95 samples and 6 extraction negatives for 300 samples.

I don't know what you mean here. "batches" are not related to different types of negative controls. The extraction controls are better for decontam.

lauraDRH commented 1 year ago

Thank you so much for your reply again!

The DNA is from isolated unicellular organisms? And you are amplifying some gene from there? Or this is some other sample type, and you are amplifying a gene that is a marker for some class of small unicellular organisms of which you are interested?

It is prokaryotic DNA extracted from a marine eukaryotic organisms. So basically I am studying the bacteria that live inside those eukaryotes. The sequences are one variable region of the 16S rRNA gene, which is typically used to sequence prokaryotes.

That would be a red flag for this sequencing data perhaps. When you look at the ASVs with the highest scores (i.e. the best evidence are not contaminants) what do you see? What you expect? Or, skin microbes?

The vast majority of them are expected marine bacteria, which is good, but then for example I have an ASV with pval of 0.96 that is a skin microbe (with 100% of taxonomic assignment). The same way, marine cyanobacteria should not be considered as contaminants but I have one, for example, which has a pval of 0.03.

I don't know what you mean here. "batches" are not related to different types of negative controls. The extraction controls are better for decontam.

Sorry I expressed myself wrong. So you recommend running the decontamination using only the extraction negatives?

benjjneb commented 1 year ago

The vast majority of them are expected marine bacteria, which is good, but then for example I have an ASV with pval of 0.96 that is a skin microbe (with 100% of taxonomic assignment). The same way, marine cyanobacteria should not be considered as contaminants but I have one, for example, which has a pval of 0.03.

If the vast majority look correct, then you are pretty safe to go forward. One can do additional filtering based on taxonomy, or at this point I would mostly do post-hoc investigation on any interesting ASVs that pop up in your analyses. It is worth double-checking taxonomic assignments with wide spectrum BLAST as well, in case it is a taxonomic misassignment.

Sorry I expressed myself wrong. So you recommend running the decontamination using only the extraction negatives?

Either just the Extraction negatives, or pooling the PCR negatives and Extraction negatives together. I don't have strong recommendation of one approach over the other given the numbers/design you described. In general though, more Extraction negatives and less PCR negatives (which only captures part of the contamination process) is better.

lauraDRH commented 1 year ago

Thank you so much four your advice, I will follow it :)