benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
146 stars 24 forks source link

Including controls in frequency method, principle behind the combination method & different types of controls #88

Closed LoreVE closed 3 years ago

LoreVE commented 3 years ago

Hi

Thanks a lot for developing this interesting tool! I'd like to start with apologising for the elaborate issue, but I have a few thoughts/questions (after going through the paper and some other issues on github), and I would be interested to hear your opinion about this.

  1. Should negative controls be included for the frequency method?

I would assume yes, because the abundance of contaminants should be highest in the low concentration controls. But on the other hand the assumption S >> C doesn't hold for the controls. So maybe not include them? (Assumption only holds for the controls if there is much cross-contamination in the controls - which I hope is not the case).

But contaminants (identified by the frequency method) do not have a high abundance in my controls, in contrast, they are often not even detected in the negative controls... Does this mean that they are not true contaminants? And that the frequency approach isn't a good method for my data? Or there are not much contaminants (too sensitive)? NB: The number of sequences gradually increases from score 0 to 0.5 - there is no clear low mode. I do have large range (> 5) fold of concentrations. I assumed that if I have both concentration data & sufficient controls the combination method is the most optimal method, but maybe not if there is no clear low-mode in my data?

  1. Is the goal of the combination method (Fisher comparison) to correct both individual methods for "false positives"? What I mean is that the other combination methods (both (&) or either (|) option) look at the methods independently (?), but the combination method somehow removes sequences identified by the prevalence or frequency methods?

    • Are frequency-identified contaminants removed in case the frequency score is close to the threshold and the prevalence score is high (not/low prevalent in controls, but highly prevalent in samples)? Might this be batch-contamination instead (e.g contamination found in all samples and the control of one extraction batch)? In that case it might still be useful to remove (i.e. use "either" method)
    • Are prevalence-identified contaminants removed in case the frequency in the samples does not follow the contaminant model? What could be the explanation for this? Some type cross-contamination (e.g. from the sample next to the control to the control)? (Related question: what is the difference between the minimum and the either option?)
  2. I have two types of controls: extraction controls (n = 13) and amplification controls (n = 8, the other 5 were "empty" after a threshold for coverage of the sequence was applied), which are expected to have quite different profiles (the latter having much less contaminants than the former)(and 303 samples).

    • For the prevalence method, I assume it is best to exclude the amplification controls not to reduce the power to detect contaminants from the extraction (not much is present in the amplification controls)? If I'd run the prevalence method with the extraction controls excluded (which might not make much practical sense, since what's present in the amplification controls should also be present in the extraction controls?), I identify 6 additional contaminants (8 in total). Is this likely cross-contamination instead of external contamination?
    • The frequency method (assuming controls should be included, see 1.) ran without the extraction controls also gives a few additional contaminants and misses about 3% of the contaminants identified by the combination and prevalence methods (which identified the same contaminants). So for the frequency method, it might be better to run it separately on both types of controls and combine afterwards (like for the prevalence method), or does the fact that including all controls does not identify some contaminants mean that they are not true (or less likely)...

I obviously understand you don't have time to advice everybody on the best approach for their specific dataset, but I would really appreciate your opinion on these thoughts, so I can make a more informed decision on how to proceed.

Thanks in advance!

benjjneb commented 3 years ago

I would assume yes, because the abundance of contaminants should be highest in the low concentration controls. But on the other hand the assumption S >> C doesn't hold for the controls. So maybe not include them? (Assumption only holds for the controls if there is much cross-contamination in the controls - which I hope is not the case).

In my experience, no they shouldn't be. And in fact the negative controls are ignored during the frequency part of the calculation when using the "combined" method in decontam (although this is also in part to make the two modes of contaminant inference independent of one another).

But contaminants (identified by the frequency method) do not have a high abundance in my controls, in contrast, they are often not even detected in the negative controls... Does this mean that they are not true contaminants? And that the frequency approach isn't a good method for my data? Or there are not much contaminants (too sensitive)? NB: The number of sequences gradually increases from score 0 to 0.5 - there is no clear low mode. I do have large range (> 5) fold of concentrations. I assumed that if I have both concentration data & sufficient controls the combination method is the most optimal method, but maybe not if there is no clear low-mode in my data?

One reason for this could be (I think) because contamination is heterogenous, and even our best controls don't capture all of it. The counter-argument is that there is some real biological factor that could correspond with the DNA concentrations in your real samples, and its correlation with certain taxa is then just a reflection of that real biological signal. That can't be ruled out and is why decontam should not be used on different types of samples at the same time (e.g. don't mix soil and skin samples together in a single isContaminant call). I will say, the strict definition of the "contaminant model" as having a -1 slope in log space does help to avoid spurious contaminant calls from these sorts of effects.

ps: You have a lot of questions! I will come back to this issue again.

benjjneb commented 3 years ago

Is the goal of the combination method (Fisher comparison) to correct both individual methods for "false positives"? What I mean is that the other combination methods (both (&) or either (|) option) look at the methods independently (?), but the combination method somehow removes sequences identified by the prevalence or frequency methods?

Yes the goal of the combined method is to "combine" the evidence from the frequency and prevalence modes in order to create a final score on which a decision should be made. Thus, it is possible that a "combined" score could be above the classification threshold while one of the individual scores (either prevalence of frequency) was below the threshold and would have been classified as a contaminant. A more aggressive approach is to remove sequences identified by either of the prevalence or frequency methods.

In response to your follow-up question after this... yes the combined method can disagree with the individual prevlance or frequency methods, in either direction. It is making a decision based on combined evidence from both methods (using Fisher's method). As to the specific mode of contamination that could produce a mixed classification between frequency and prevalence modes (or combined), I can speculate endlessly, but without any confidence. Contamination is multi-faceted. Some types of contamination have very strong statistical signals, and decontam will do a good job at removing them. Other types are more challenging.

benjjneb commented 3 years ago

I have two types of controls: extraction controls (n = 13) and amplification controls (n = 8, the other 5 were "empty" after a threshold for coverage of the sequence was applied), which are expected to have quite different profiles (the latter having much less contaminants than the former)(and 303 samples).

Given your good number of extraction controls, I would use the prevalence method with those controls alone, as that should have the best power to detect contaminants in a lot of normal scenarios. Extraction is an important step at which contaminants are introduce into marker-gene sequencing experiments in most cases. The non-extraction controls will mostly dilute the signal of that part of the contamination.

benjjneb commented 3 years ago

The frequency method (assuming controls should be included, see 1.) ran without the extraction controls also gives a few additional contaminants and misses about 3% of the contaminants identified by the combination and prevalence methods (which identified the same contaminants). So for the frequency method, it might be better to run it separately on both types of controls and combine afterwards (like for the prevalence method), or does the fact that including all controls does not identify some contaminants mean that they are not true (or less likely)...

Honestly, if this is only changing total contaminants identified by about 3%, I would just take that as success. decontam is about doing as good a job as we can at identifying the most egregious contaminants in typical datasets. But on the margins, we claim no perfection. Contamination is a multi-source, multifaceted problem. decontam can really help, but it won't make things perfect (+-3%).

LoreVE commented 3 years ago

Thank you for taking the time to share your thoughts to my questions!

I thought I replied already, but I must have forgotten to hit the comment button back then... I came back to reread the post and noticed the lack of response. I really appreciate your insights! It helped a lot in making the decision on how to proceed!