Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
506 stars 165 forks source link

Better define end and start locations and adjust implementation to that #106

Open Martinsos opened 6 years ago

Martinsos commented 6 years ago

I found that right now end and start locations are not well enough defined. While their current definition works fine for usual cases, it is too ambiguous and does not cover some special cases, mostly when query and target are overlaping.

For example, one of such cases is when one of optimal solutions is having query in front of target, like this:

HW mode:
   CTG  <- target
AAA  <- query

In this case, the end location is -1. What does that mean? What is the start location then? I would also like to notice that in this case there are always other optimal solutions with end locations: 0, 1, ... .

Right now definition of start and end locations makes sense only when query's location is completely inside the target. I should think about this and how to define it in more broader manner, so it also works for other situations.

It is also important to notice that in Edlib, step for finding alignment path uses information about start and end locations to limit it's search space. Therefore, they should also be defined in such way that finding of alignment path can work correctly using them. Right now it works correctly, probably because it is robust enough, but I want it to be clearer and more direct, less magic.

While at this, I should probably also do #93, since it is all really interconnected.