Open johnlees opened 4 months ago
Rather than converting from current .skd, probably easier to have a dedicated reverse index constructor function, then store an enum with the sketch type in the metadata.
See also #11 for some earlier thoughts on this.
inverted
.Sketch::new()
into two functions, taking the first part out which creates the signs vec, so this can be called separately in the new function.Vec<Vec<u64>>
which is a list signs across the samples (each signs is Vec<u64>
for the sketch across the bins).Vec<Hashmap<u64, BitVec>>
from these. The outer Vec is across the bins, in the same order as the Vec<u64>
. For each bin, the Hashmap
has the bin value as the u64
key, and a list of samples with that bin value as the value stored as a BitVec
. This BitVec
will be a list of zeros with the same length as the number of samples, but then with 1
bits inserted at the indexes of the samples with that bin value. See https://docs.rs/bitvec/latest/bitvec/vec/struct.BitVec.html.First use case would be to add a distance function against a new query sample:
Then later, some optimisations:
u64
keys can become u16
, just taking the LSBs (similar to bbits).BitVec
can be replaced with https://docs.rs/roaring/latest/roaring/bitmap/struct.RoaringBitmap.html.Also, ignore parallelisation and memory use for now – I will try and add these optimisations in future.
Notes: Find group of queries which share k-mer in a bin Calculate dists of these to centre (longest) Cluster: 'Briefly, the file with the validated directed edges from center sequences to member sequences is read in and all reverse edges are added. The list of input sequences is sorted by decreasing length. While the list is not yet empty, the top sequence is removed from the list, together with all sequences still in the list that share an edge with it. These sequences form a new cluster with the top sequence as its representative.'
Use a reverse index First step: sketch between those which share a bin Can give assembly quality as input and presort, top will always be best (to find representative to align against)