dnbaker / dashing2

Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.
MIT License
62 stars 7 forks source link

Edit Distance: Proper Support #15

Closed dnbaker closed 3 years ago

dnbaker commented 3 years ago

This commit provides better support for the use of edit distance in sketching and distance comparison. This is both in sequence space, where we compute edit distance between sequences, and in minimizer space, where we compute edit distances between sequences of 64-bit minimizers.

For the FULL_MMER_SEQUENCE format, this means that the edit distance is computed between the minimizer sequences.

For --edit-distance, the default output is the fraction of shared registers. If --compute-edit-distance or --exact-kmer-dist is added, then the edit distance between the sequences is emitted in the output matrix.

The key idea here is that the LSH table is used for candidate pruning, but final distances come from exact computation.