benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
140 stars 24 forks source link

Working with low bio mass placental samples #57

Open toluayo opened 4 years ago

toluayo commented 4 years ago

Thanks a lot for the decontam package. i am extremely new to programming and i tried out decontam .

i started with the "iscontam" function

1) The first graph is a histogram which influenced my choice of using a threshold of 0.5

image

2)The second graph is a prevalence of the decontam identified taxa in the true samples and in negative controls with the threshold set at 0.5. There is no clear cut separation which is my challenge.Do you think this threshold is appropriate or is it too aggressive. I have tried searching for a reproducible example for isnotdecontam but i havent come across any that works with my dataset.

image

3) The third graph is a graph of the library size, the outliers( right top corner) are prominent because they are part of a different sequencing run. In addition, the controls do not have a lower library size as it would be expected but this is probably the case since my samples are definitely low biomass and shouldnt have a lot of microbes or any at all. image

4) Even after removing the taxas identified as contaminants, a quick look at what is left still shows some phyla that are known to be contaminants.Which is why i think it may be better for me to use isnotcontam. What do you think?

benjjneb commented 4 years ago

I do think in this case (placental microbiome, perhaps there are no true bacterial sequences) you probably want to use isNotContaminant. In essence, that function puts the burden of proof on each sequence/OTU to show that it is not a contaminant -- the default assumption is that most are contaminants.

The challenge here is that your number of controls is low for doing rigorous non-contaminant identification using the prevalence method, so you are somewhat underpowered for detecting non-contaminants. Its probably the best you can do though.

ScaonE commented 4 years ago

Dear all,

My topic is very close to the one presented above by @toluayo (16S sequencing on placental samples), thus I am dropping a quick question in there @benjjneb .

I will try to apply Decontam on my dataset very soon. I have 3 neg-controls & 36 samples. Is the prevalence method going to be underpowered (not enough neg-control)?

Edit : Some additional details: Dataset consists of 4 placenta 3 sampling areas per placenta 3 extraction methods => 36 samples. We have 1 neg-control per extraction methods => 3 neg-controls. Atm, number of classified reads (without using Decontam) at the end of the workflow is roughly the same in samples and in neg-controls.

I plan to account for the extraction batch effect using a blocking factor (or should I first run an ordination method to see if samples cluster by extraction method. If it's not the case, maybe I don't have to use a blocking factor?) . I will try to detect candidate non-contaminant OTUs using isNotContaminant.

Edit 2: It seems that the isNotContaminant function doesn't have a batch argument. Can we take batch effects into account using isNotContaminant?

benjjneb commented 4 years ago

@ScaonE Unfortunately I think that isNotContaminant is going to be too underpowered to achieve what you want in this dataset given there are just 3 negative controls. When looking at really low biomass samples, to powerfully detect non-contaminants, it is necessary to have a significant number of controls to be able to identify the full spectrum of contaminants, as there tends to be a fair amount of variation in the specific contaminants found in each negative control sample in practice. You may want to try an alternative approach, which could be more ad hoc but is likely to be more appropriate given this study design.

Also, you are correct that batch is not supported in isNotContaminant. There are some conceptual reasons why its a bit more complicated to include batch information in determining non-contamanints than vice versa, hence the lack of support for that option.

toluayo commented 4 years ago

will direct subtraction of all OTU's found in the controls be a sufficient alternative, considering the study design or can isContaminant still be used with both the prevalence and frequency method used?

benjjneb commented 4 years ago

will direct subtraction of all OTU's found in the controls be a sufficient alternative, considering the study design or can isContaminant still be used with both the prevalence and frequency method used?

You can try using isContaminant with the frequency method, and it should help to remove some of the contaminants. Removing all OTUs/ASVs in the negative controls is another.

Nothing is going to be perfect though. Placenta samples are very low biomass, maybe often zero bacterial biomass. This is a case where you really want extensive negative control data.

ScaonE commented 4 years ago

Thanks for your input @benjjneb

tabaresr commented 4 years ago

Dear all, I did amplicon sequencing (16S) from woodchip bioreactors that were used to clean water contaminated with pesticides. I am very new to the field and I am still learning how to analyze the data.

I used the decontam package but I am struggling a bit trying to understand the results. It seems I don't have a clear separation when I plot the prevalence of the decontam taxa in the true samples and in the negative controls.

Could you guide me a little bit more on what could be happening with the data? Why is that I don't have a clear separation of the true sample and the control sample? Why is that the number of reads in some of the control samples are higher than some true samples when it must be the opposite?

Thank you very much.

Sample_or_Control

f2

benjjneb commented 4 years ago

@tabaresr Could you open up a new issue? This looks like a pretty different question, so a new issue would be appropriate.

tabaresr commented 4 years ago

I am sorry @benjjneb , I am new to this platform. I opened a new issue : Amplicon seqeucning:decontam. Thank you.

El lun., 11 may. 2020 a las 20:23, Benjamin Callahan (< notifications@github.com>) escribió:

@tabaresr https://github.com/tabaresr Could you open up a new issue? This looks like a pretty different question, so a new issue would be appropriate.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benjjneb/decontam/issues/57#issuecomment-627035562, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANS7SLZR6SKRVC2FUOHZRWTRRCJHPANCNFSM4IWAP2BA .