Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
514 stars 167 forks source link

Edlib could be faster if starting with bigger k for unsimilar sequences in HW mode #45

Closed Martinsos closed 8 years ago

Martinsos commented 8 years ago

Tests showed that for HW, when score is big enough, it may be more beneficial to start with larger k! For example, when similarity (1 - score / read_length) is < 60% we can get better results by just running edlib with k = read_length then using k = -1.

So how can we use this to speed up edlib? If we could have some way of very quickly and roughly estimating the similarity of two sequences up front, we could make a decision: "they seem to be pretty unsimilar, so lets use k=read_length instead of k=-1".

Martinsos commented 8 years ago

But actually, if we have they way to estimate similarity, than we can use that do start with better k, so this is covered by that.