dnbaker / dashing2

Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.
MIT License
62 stars 7 forks source link

Dev17 #22

Closed dnbaker closed 3 years ago

dnbaker commented 3 years ago

Add greedy clustering using affinity thresholds.

The current parallelization with locks is ineffective and not much faster; the smart approach will be a divide-and-conquer algorithm, by which a tuple of (LSHTable, Representatives, Constituents), where the LSHTable is an LSH table for the items in Representatives. These can be easily merged, and this yields a parallelized, lock-free approach with the same asymptotic complexity.

Currently, it only works in single-threaded mode.

dnbaker commented 3 years ago

Parallel greedy clustering is now fixed and fully parallelized, and works with many threads.

This clusters Archaea in 5.8s if pre-sketched, with S = 1337. This clusters 33k bacterial genomes in 1m30, 56k in 2m20, and 100k in 3m42 using 20 threads. Utiliization seems to not be complete, but at least it's behaving sublinearly.