dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

K>12 results in too many distances being 0 #40

Closed censix closed 4 years ago

censix commented 4 years ago

Hi I have been using 'dashing' since November and am quite impressed. I mainly calculate full symmetric distance matrices for large datasets downloaded from genbank, i.e. the 'plant' or 'fungi' clades. I am using a k-mer length value of k=12 or k=13 to get reasonable distance matrices, meaning that whenever I increase beyond k=15, the number of distance values that is 0 becomes too large. Likewise when k gets much smaller, the number of distance values that is 1 becomes too large. Running with k=31 is entirely out of the question. So I am asking myself, is that to be expected or is this maybe due to some error I made during setup? I am using precompiled binaries. I'd be glad for a hint. Thanks and Cheers

dnbaker commented 4 years ago

Hi censix,

You've discovered what is, to a large extent, the key issue in k-mer based comparisons. If k is too large, it's too specific, but if it is too small, it is too sensitive.

It's not a problem with your usage, it's a fundamental issue with how these methods work. We do have some methods we're working on for generalizing these comparisons, but for now, you'll have to experiment with different values of k for each set of sequences.

Good luck, and let me know if you have any more questions.

censix commented 4 years ago

Thanks very much for the clarification. I somehow suspected that this was something inherent to the k-mer method. Good to know its not due to my ignorance. Keep up the good work!