This PR adds a new approach to query for the specificity of k-mers. It works by sharding them across many small tries. See the new kmer_shard module for details, including the data structure and algorithms.
This adds a class, AlignmentQuerierWithKmerSharding, to implement specificity queries using the new data structures. It renames the old AlignmentQuerier to AlignmentQuerierWithLSHNearNeigbhor.
In contrast to the previous LSH near-neighbor approach, the sharding approach is meant to be faster, use less memory, and not be probabilistic (i.e., if there is non-specificity, it must report that).
This exposes an argument in design.py, --id-method, to select which approach to use.
In a benchmark on 17 flavivirus species, the sharding approach took ~7 h and used 26 GB of memory at its peak, whereas the LSH near-neighbor approach took ~25 h and used 37 GB of memory at its peak.
This PR adds a new approach to query for the specificity of k-mers. It works by sharding them across many small tries. See the new
kmer_shard
module for details, including the data structure and algorithms.This adds a class,
AlignmentQuerierWithKmerSharding
, to implement specificity queries using the new data structures. It renames the oldAlignmentQuerier
toAlignmentQuerierWithLSHNearNeigbhor
.In contrast to the previous LSH near-neighbor approach, the sharding approach is meant to be faster, use less memory, and not be probabilistic (i.e., if there is non-specificity, it must report that).
This exposes an argument in
design.py
,--id-method
, to select which approach to use.In a benchmark on 17 flavivirus species, the sharding approach took ~7 h and used 26 GB of memory at its peak, whereas the LSH near-neighbor approach took ~25 h and used 37 GB of memory at its peak.