Closed censix closed 4 years ago
Hi censix,
You've discovered what is, to a large extent, the key issue in k-mer based comparisons. If k is too large, it's too specific, but if it is too small, it is too sensitive.
It's not a problem with your usage, it's a fundamental issue with how these methods work. We do have some methods we're working on for generalizing these comparisons, but for now, you'll have to experiment with different values of k for each set of sequences.
Good luck, and let me know if you have any more questions.
Thanks very much for the clarification. I somehow suspected that this was something inherent to the k-mer method. Good to know its not due to my ignorance. Keep up the good work!
Hi I have been using 'dashing' since November and am quite impressed. I mainly calculate full symmetric distance matrices for large datasets downloaded from genbank, i.e. the 'plant' or 'fungi' clades. I am using a k-mer length value of k=12 or k=13 to get reasonable distance matrices, meaning that whenever I increase beyond k=15, the number of distance values that is 0 becomes too large. Likewise when k gets much smaller, the number of distance values that is 1 becomes too large. Running with k=31 is entirely out of the question. So I am asking myself, is that to be expected or is this maybe due to some error I made during setup? I am using precompiled binaries. I'd be glad for a hint. Thanks and Cheers