Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
505 stars 165 forks source link

Inconsistency with API on empty strings #228

Open nkkarpov opened 3 weeks ago

nkkarpov commented 3 weeks ago

For example, edlib.align("ACGT", "", k=1) returns 4 instead of -1. Additionally, it appears that the CIGAR string and locations are unreliable when the query string is empty, as they contain None even when k is larger than the edit distance.

edlib.align("ACGT", "", k=5, task="path")
{'editDistance': 4,
 'alphabetLength': 4,
 'locations': [(None, -1)],
 'cigar': None}
Martinsos commented 1 week ago

Thanks for reporting this!

Steps I think we should take:

  1. It seems you reported issue with python version of Edlib -> I wonder if this is issue also with C/C++ edlib, so the issue is originating from there, or this is happening only in the python binding in the logic that wraps C/C++ edlib.
  2. "Easy fix" for this is to update docs/comments that empty strings are not acceptable + throw runtime error on empty strings.
  3. Proper fix would be to make sure that edlib correctly handles empty string. We would first need to define what behaviour we expect, and then implement it. Also add tests for these cases.

I am short with time at the moment + this is easy to handle on the user side so I won't be prioritizing it at the moment, but I would be happy to accept PRs, as long as they follow the steps above.