Does kat cold exclude low frequency k-mers?

malonge commented 5 years ago

Hi there,

I want to make sure I understand how kat cold works. My understanding is that it will first establish a database of k-mer from provided reads. Then, for a given contig, for each k-mer in that contig, it will check to see if that k-mer is present in the database. The number of k-mers present in the database is reported as "non_zero_kmers". Is that correct?

If so, does kat remove low frequency k-mers from the provided reads which are most likely error k-mers?

Thanks

bjclavijo commented 5 years ago

Hi Michael,

You are correct in your explanation, but remember it is not only presence that's checked but also count. non-zero-kmers are just that, k-mers that have count >0 in the hash.

There is no low-frequency cutoff, in theory it should not really matter much for the median, which is used to produce the plot, unless you have a massive number of error k-mers being incorporated into your assembly. In general that should be checked first with KAT comp. In general the philosophy of KAT is to not make judgments about thresholds and the like, since most information is there in the raw frequency counts.

I hope that solves the question, if not please write to me via email: Bernardo.clavijo (at) Earlham.ac.uk . I'll close this issue as it is more of a question really. I'm kinda curious in which scenario you're using KAT cold as it is a relatively obscure tool that has been mostly used internally at EI, so if you can, drop me a line :-)

malonge commented 5 years ago

Thank you!

TGAC / KAT

Does kat cold exclude low frequency k-mers? #129