bacpop / sketchlib.rust

Rust reimplementation of pp-sketchlib
Apache License 2.0
2 stars 1 forks source link

Dereplication of genomes #21

Open johnlees opened 4 months ago

johnlees commented 4 months ago

Notes: Find group of queries which share k-mer in a bin Calculate dists of these to centre (longest) Cluster: 'Briefly, the file with the validated directed edges from center sequences to member sequences is read in and all reverse edges are added. The list of input sequences is sorted by decreasing length. While the list is not yet empty, the top sequence is removed from the list, together with all sequences still in the list that share an edge with it. These sequences form a new cluster with the top sequence as its representative.'

Use a reverse index First step: sketch between those which share a bin Can give assembly quality as input and presort, top will always be best (to find representative to align against)

johnlees commented 3 months ago

Rather than converting from current .skd, probably easier to have a dedicated reverse index constructor function, then store an enum with the sketch type in the metadata.

johnlees commented 2 months ago

See also #11 for some earlier thoughts on this.

First use case would be to add a distance function against a new query sample:

Then later, some optimisations:

johnlees commented 2 months ago

Also, ignore parallelisation and memory use for now – I will try and add these optimisations in future.