Open EisenRa opened 5 years ago
Hi Raphael, sorry for the delayed reply. I can tell you that we got those numbers by plotting a histogram of scores from sequences derived from low-biomass samples that had both authentic (high-complexity) and likely contaminant (low-complexity) DNA, and found that the histogram was bimodal with one mode in the lower end and one mode in the higher end. We chose the threshold of 0.55 as being the midpoint between the two modes that separated the two best.
I would suggest you do the same to validate that threshold or choose a new one. I suspect that 0.55 would still be a reasonable number since we normalized for sequence length, but it's always good to check. Let me know if that makes sense.
I also like that feature idea, though my bandwidth is quite constrained these days. If you make it a separate feature request I can keep it in my queue.
Best, Erik
Dear Erik,
No problem, thanks for responding! That makes sense. Do you have a script/markdown available to generate such a histogram -- such that I don't have to reinvent the wheel?
I'll make a feature request.
Thanks, Raphael
Dear Erik,
Thanks for writing this tool!
I have a quick question regarding the use of this tool for ancient DNA data. Some of my work is on degraded DNA, which is typically log-normally distributed with a mode of ~50 bp. You mentioned that a k of 4 and threshold of 0.55 works well for 64-120 bp sequences, and I am wondering if you've tested shorter sequences (30-64 bp)?
If not, I'll have a play around with some of my data and get back to this thread.
Additionally, a feature that may be useful is the ability to provide an output file for the filtered sequences (I can make a feature request if you think it's worthwhile).
Cheers, Raphael