Closed dnbaker closed 3 years ago
Parallel greedy clustering is now fixed and fully parallelized, and works with many threads.
This clusters Archaea in 5.8s if pre-sketched, with S = 1337. This clusters 33k bacterial genomes in 1m30, 56k in 2m20, and 100k in 3m42 using 20 threads. Utiliization seems to not be complete, but at least it's behaving sublinearly.
Add greedy clustering using affinity thresholds.
The current parallelization with locks is ineffective and not much faster; the smart approach will be a divide-and-conquer algorithm, by which a tuple of (LSHTable, Representatives, Constituents), where the LSHTable is an LSH table for the items in Representatives. These can be easily merged, and this yields a parallelized, lock-free approach with the same asymptotic complexity.
Currently, it only works in single-threaded mode.