dnbaker / dashing2

Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.
MIT License
61 stars 7 forks source link

dartminhash implenmentation #67

Open jianshu93 opened 1 year ago

jianshu93 commented 1 year ago

Hello Daniel,

Just want to ask whether there are plans for also implement dartminhash (https://arxiv.org/abs/2005.11547), which is very fast for sparse dataset (genomes, metagenomes), and much faster than bagminhsh according to my reading for very sparse dataset and large k (minhashes).

Thanks,

Jianshu

dnbaker commented 1 year ago

Hi Jianshu -

Thanks for asking! Sorry it's been so long. That's something that's definitely possible. I'll see if I can work on it in the next week.

Take care,

Daniel

jianshu93 commented 1 year ago

Hello Daniel,

A follow up question, is the one permutation MinHash/SetSketch in Dashing2 the one used in BinDash (I noticed that you also contribute to it), which is the b-bit one permutation MinHash with optimal densification (http://proceedings.mlr.press/v70/shrivastava17a/shrivastava17a.pdf). I feel like BinDash is faster than Dashing 1 (HLL) and also one permutation SetSketch.

Thanks,

Jianshu

dnbaker commented 1 year ago

It varies by the task at hand. The raw comparison speed of bindash's packed sketches is very fast because they can terminate their loop early, and bindash is an excellent software application. I'm not sure on dashing2 vs bindash runtime in all-pairs dense comparisons. But both Dashing1 and Dashing2 sketch substantially faster than bindash, and we've measured dashing2's comparisons to be 20x as fast as dashing1's, so I expect it to be sufficiently performant for pairwise comparisons.

More important for large problems, I think, Dashing2's K-nearest neighbor mode, distance thresholded sparse distance matrix, and nearly-linear clustering via --greedy allow it to scale in both space and time without being limited to performing all-pair comparisons.

Apart from that, our code is quite general, supporting weighted sketching with either prob or bagminhash, can perform complete k-mer set and weighted k-mer set comparisons, consumes a variety of alphabets, and can do minimizer set comparisons.

Happy to discuss further!

And I'm definitely behind on a faster weighted sketching algorithm, but I'm planning to get to it soon. I'm encouraged by the fact that I've been able to dedicate some attention.

Best,

Daniel