Open fanli-gcb opened 7 years ago
if DNA quant is undetectable. From our experience, this happens quite often on something like Qubit or Tapestation. Since these are log-transformed, would it make sense to set this to something like 0.001?
This is something we maybe should think about more, and are very interested in any wisdom on what works particularly well. What I have done to this point is set "undetectable" to the minimum detection value. Another option would be to exclude "undetectable" samples from the frequency-method model fitting.
how to handle different negatives, e.g. extraction blanks versus PCR blanks. Typically we treat them identically for the purposes of similar ad-hoc filtering that we've used in our processing.
Right now we simply strongly recommend that your negative controls go through the extraction step, as it is a clear source of reagent contamination. No guidance at this point on what to do with multiple types of controls, as we haven't done the necessary testing there.
for low biomass samples, any alternative to using only prevalence? For example, we've looked at SV frequency in blanks / total frequency, so say for example a SV has 10000 reads total of which 9000 are derived from blanks. That'd be a contaminant. The distribution of this fraction is also bi-modal, so relatively easy to draw a line somewhere in the middle:
The frequency approach kind of works, as suggested by your plot, but it falls short in one area: The most extreme probabilities are not the most likely non-contaminants. For prevalence, the more you see a sequence in real samples and less in controls, the more likely it is a non-contaminant. But for frequency that breaks down. Extremely flat realtionships of frequency distribution is with concentration tend to reflect sequences that randomly fall exactly on a flat line, rather than the "strongest" non-contaminants.
@benjjneb great method, really appreciate that we're not alone in caring about this!
Thanks! Hope it is helpful, and we are very interested in hearing where things work and don't, as we will continue to iterate with user feedback like we did with the dada2 package.
The frequency approach kind of works, as suggested by your plot, but it falls short in one area: The most extreme probabilities are not the most likely non-contaminants. For prevalence, the more you see a sequence in real samples and less in controls, the more likely it is a non-contaminant. But for frequency that breaks down. Extremely flat realtionships of frequency distribution is with concentration tend to reflect sequences that randomly fall exactly on a flat line, rather than the "strongest" non-contaminants.
Sorry, I don't think I understand. Are you saying that the most extreme frequencies (ie the bar on the far left) don't necessarily mean the strongest non-contaminants? Why would that be?
Thanks for insightful comments on other points - we'll play around some more with a few different datasets and share results here.
Any thoughts on how to deal with the following cases?
if DNA quant is undetectable. From our experience, this happens quite often on something like Qubit or Tapestation. Since these are log-transformed, would it make sense to set this to something like 0.001?
how to handle different negatives, e.g. extraction blanks versus PCR blanks. Typically we treat them identically for the purposes of similar ad-hoc filtering that we've used in our processing.
for low biomass samples, any alternative to using only prevalence? For example, we've looked at SV frequency in blanks / total frequency, so say for example a SV has 10000 reads total of which 9000 are derived from blanks. That'd be a contaminant. The distribution of this fraction is also bi-modal, so relatively easy to draw a line somewhere in the middle:
@benjjneb great method, really appreciate that we're not alone in caring about this!