Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
515 stars 167 forks source link

Cannot identify alignments with more than 0 mismatches. #229

Closed Anjan-Purkayastha closed 1 week ago

Anjan-Purkayastha commented 1 week ago

Describe the bug I am running the edlib.align function to identify a test primer sequence in a longer template sequence. There are two locations that the test_primer sequence is embedded in the test_template. These are at positions: 9-29 with 0 mismatches, and at position: 140-160, with 1 mismatch at position 145. When I run edlib.align specifying at most 3 mismatches, edlib identifies only the position with 0 mismatches. Position with 1 mismatch is not displayed.

To Reproduce Code to run: left_alignment = edlib.align(test_primer, test_template, mode='HW', task='locations', k = 3 ) Please use the attached files.

Expected behavior Since maximum mismatch is set at 3 I expect to see both alignments reported.

Environment (please complete the following information):

Additional context

Anjan-Purkayastha commented 1 week ago

Here is another example: edlib.align('ATGC', 'ATGTATGC', mode = 'HW', task = 'locations', k = 1)

Expected result: The following locations will be identified (0,3) - 1 mismatch; (4,7)- 0 mismatch Instead, here is the output: {'editDistance': 0, 'alphabetLength': 4, 'locations': [(4, 7)], 'cigar': None} Only the perfect match is reported, location with 1 mismatch is not reported. Another error: If match starts at first position and ends at, say, position 8, location report (None, 8). This should be corrected to (0,8).

Martinsos commented 1 week ago

Hey @Anjan-Purkayastha -> this is not a bug, it is how edlib works, I am sorry if this was not obvious from the docs.

Check the comments here https://github.com/Martinsos/edlib/blob/master/edlib/include/edlib.h -> so what k means is that you don't care about the solution if edit distance is larger than k. But that doesn't mean it will return all solutions below k -> it will always return only one solution. You can see that in the result object -> it returns a single edit distance and single end location. It can return multiple possible start locations though, but that is it.

If you think this is unclear in the docs, I would appreciate a PR that would clear it up!