Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
515 stars 167 forks source link

How to report all possibale positions not only the best one ? #216

Closed cherryamme closed 1 week ago

cherryamme commented 1 year ago

Hello I try to get multiple possible positions at which the alignment ends, but it seems like edlib.align only report the most possible site. the k arg is just a cut steps to filter. can I get all possible positions ? for example : when I set a arg k=2 , it will report sites combine editDistance=0 editDistance=1 and editDistance=2. @Martinsos Thanks

Martinsos commented 1 year ago

Hi @cherryamme ! Sorry, I am not sure I understand the question. Did you say you would like to get the solution for distance = 0, distance = 1 and so on? That is not how edlib works -> it looks for the smallest edit distance. It can report multiple end locations, but they will all be optimal solutions.

Why would you be interested into finding suboptimal solutions?

Anjan-Purkayastha commented 1 week ago

@Martinsos : I have posted issue # 229 on this same problem. If what you say is true then the meaning of the parameter k: maximum number of mismatches is misleading. I understand this to be the following: if I specify a k of 2, then edlib.align will report all positions that have a mismatch of 0,1 or 2. Otherwise, providing an upper bound for mismatches, only to have the best match reported is meaningless. To answer your question as to why- I am trying to determine all possible primer binding sites for a given template- primer being the query and template being the target. It is important to identify not only the perfectly match sites but also the ones with a few mismatches, as these could also serve as priming sites.

Martinsos commented 1 week ago

@Martinsos : I have posted issue # 229 on this same problem. If what you say is true then the meaning of the parameter k: maximum number of mismatches is misleading. I understand this to be the following: if I specify a k of 2, then edlib.align will report all positions that have a mismatch of 0,1 or 2. Otherwise, providing an upper bound for mismatches, only to have the best match reported is meaningless. To answer your question as to why- I am trying to determine all possible primer binding sites for a given template- primer being the query and template being the target. It is important to identify not only the perfectly match sites but also the ones with a few mismatches, as these could also serve as priming sites.

I understand what you woud like to have and why you find it important for your use case, but that is not how edlib works, at its core, and is out of its feature scope. I am pretty sure docs never state that edlib will report all positions, quite the opposite, you can see in the type of result that only one match is reported. k does have a meaningful impact that is described in the docs -> it significantly speeds up the computation because you tell it "I know the solution I am looking for is not >k".

As I said in another issue, I am sorry that this wasn't clear from the docs / README -> you will know the best what was unclear, so if you want to make a PR making this clearer, I would appreciate it!