broadinstitute / adapt

A package for designing activity-informed nucleic acid diagnostics for viruses.
MIT License
28 stars 1 forks source link

Add trie sharding method to query the specificity of k-mers #19

Closed haydenm closed 4 years ago

haydenm commented 4 years ago

This PR adds a new approach to query for the specificity of k-mers. It works by sharding them across many small tries. See the new kmer_shard module for details, including the data structure and algorithms.

This adds a class, AlignmentQuerierWithKmerSharding, to implement specificity queries using the new data structures. It renames the old AlignmentQuerier to AlignmentQuerierWithLSHNearNeigbhor.

In contrast to the previous LSH near-neighbor approach, the sharding approach is meant to be faster, use less memory, and not be probabilistic (i.e., if there is non-specificity, it must report that).

This exposes an argument in design.py, --id-method, to select which approach to use.

In a benchmark on 17 flavivirus species, the sharding approach took ~7 h and used 26 GB of memory at its peak, whereas the LSH near-neighbor approach took ~25 h and used 37 GB of memory at its peak.