benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
147 stars 25 forks source link

Clarification on Batch.combine and when to use certain parameters #79

Closed hbussan closed 4 years ago

hbussan commented 4 years ago

Hello,

I am working through a dataset from two separate MiSeq runs and three PCR plates. The PCR protocol is the same across plates and we sequenced them on an in-house MiSeq - should I run them in batches by PCR plate or MiSeq? When I ran them without batch analysis, with batch analysis by MiSeq, and then by plate, the number of contaminants by prevalence and combined methods didn't vary by much(or at all). Frequency at threshold 0.5 changed quite a bit (5-40 sequences) but based on my p.freq values, freq threshold = 0.5 is too high for my sequences. Based on what I have read, that can be a sign that I have a lot of low abundance sequences or I have cross contamination correct?

I have been using the default batch.combine=minimum. When should one use batch.combine fisher or product?

Thank you for your time and for decontam!

benjjneb commented 4 years ago

When I ran them without batch analysis, with batch analysis by MiSeq, and then by plate, the number of contaminants by prevalence and combined methods didn't vary by much(or at all).

That's good, it means you probably can choose either option and get much the same result.

Based on what I have read, that can be a sign that I have a lot of low abundance sequences or I have cross contamination correct?

Yes, both of those mechanisms can produce middling scores (~0.5). Typically its recommended to use a lower threshold that picks out the definitely low-score mode in your data.

I have been using the default batch.combine=minimum. When should one use batch.combine fisher or product?

I don't have good guidance on this. I think minimum is a good choice usually. I would consider the others in very large studies with large numbers of batches (e.g. a 80 run study), to avoid one spuriously small score driving removal of a legitimate sequence.

hbussan commented 4 years ago

Thanks for your quick response!