Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
506 stars 165 forks source link

request: scoring of indels #116

Closed rderelle closed 6 years ago

rderelle commented 6 years ago

it is not strictly speaking an issue, more like a request for further improvements.

Using biological data (DNA,RNA or protein sequences), one usually consider that the deletion of n adjacent characters is the result of a unique evolutionary event (same apply for insertion).

The problem is that edlib will score n edits for the deletion/insertion of n adjacent characters. For instance, in this case edlib will return an editDistance of 3: TAGCGTAGCTAGCCTATTATCG TAGCGTAGCTA --- TATTATCG ... while the most parsimonious answer is 1 (i.e. 1 change, consisting of the insertion/deletion of GCC). nb: I believe this reasoning is correct for the comparison of any kind of string.

So, I was wondering if it would be possible to add on option to edlib.align() to score 1 edit for any insertion/deletion of n adjacent characters.

thanks.

Martinsos commented 6 years ago

Hi @romain22, thank you for opening an issue, that is a completely reasonable feature request, however edlib is edit distance library and as such does not support Gotoh (gaps), which is what you described. Due to it's nature, it can not support them, as it would not be edit distance any more. There are other algorithms out there (and libraries) that offer support for that!