benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
144 stars 24 forks source link

Optimal data for the frequency method? #33

Open mniku opened 5 years ago

mniku commented 5 years ago

I’m slightly uncertain which DNA measures are optimal for the frequency based method (especially in case of samples containing animal tissues, where the proportion of microbial to animal DNA is often small and/or highly variable):

Obviously only the qPCR data tells the actual original amounts of starting material. But on the other hand, there are many steps between this and the final sequence data, so that the final DNA amounts used in the sequencing are something completely different.

How should we evaluate the applicability of frequency based method in specific cases? Such as, how high read counts in negative controls vs. actual samples are acceptable for the statistics?

benjjneb commented 5 years ago

We believe that both types of DNA quantitation data will work. We have more testing using the DNA concentration post-PCR and prior to sequencing, simply because that data is more often available as it is generated "for free" as part of the usual sequencing workflows anyway. But in the more limited testing on qPCR data the method still seems to work, and other publications report strong patterns of inverse frequency of contaminants using qPCR data - which is the pattern the frequency method relies on.

How should we evaluate the applicability of frequency based method in specific cases? Such as, how high read counts in negative controls vs. actual samples are acceptable for the statistics?

The simplest and most useful evaluation is to inspect the distribution of scores assigned by the method. The expectation is that there will be a strong mode at low scores. In the cleanest cases the distribution will be clearly bimodal, while in other datasets the high-score mode is more wide and diffuse. However, the low-score mode should be there, and should be used to set the P* score threshold for identifying contaminants.

Another method is to simply inspect a few of the identified contaminants using the plot_frequency function.

mniku commented 5 years ago

Thanks, this is now clear!

rturba commented 4 years ago

How would the pre-pooling DNA concentration be valuable if samples are pooled to be equimolar? Would using that data still make sense? I ran my samples together with other people that have different equimolar concentrations. In that case, would that be a more valuable data to use? Thank you!

benjjneb commented 4 years ago

@rturba

How would the pre-pooling DNA concentration be valuable if samples are pooled to be equimolar? Would using that data still make sense?

Yes. Pre-pooling DNA concentrations still track the fraction of the sample reads that derive from teh sample vs. from contaminants.

I ran my samples together with other people that have different equimolar concentrations. In that case, would that be a more valuable data to use?

Probably not. For identifying contaminants, you want to use samples that shared the same sample preparation history. Samples that were prepared differently will typically just be noise.