Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
492 stars 162 forks source link

Choosing longest or shortest alignment #213

Open brendanf opened 1 year ago

brendanf commented 1 year ago

I am testing edlib for the purpose of calculating pairwise identites between sequences, using the BLAST/USEARCH definition of sequence identity; i.e. $\text{id} = \frac{\text{edit distance}}{\text{alignment length}}$. One difficulty I am finding is that there can be multiple alignments which have the same edit distance, but different lengths, even in the fully global case, because it may be possible to trade, e.g., 2 mismatches for 1 insertion and 1 deletion. This means that the pairwise identity is not uniquely defined. Is there any way for edlib to give me a maximally long and/or maximally short CIGAR, or even just calculate what the maximum and minimum lengths are?

If this is possible but would require some coding, I'd be willing to give it a shot if you give me some pointers.

Martinsos commented 1 year ago

@brendanf I am not 100% sure as it has been some time since I have implemented the core part of Edlib, but off the bat I would guess that is probably not easy to do. Edlib is fast because it is optimized to use edit distance as scoring mechanism, and adding any additional criteria, like for example to try and find minimal size of cigar, will almost certainly be hard to fit into that. One thing you can do is tell it to return multiple possible positions at which the alignment ends -> is that helpful? You can't make sure those are the min and max though if I am correct.

I can't give much better answer right now without spending significant amount of time on figure this all out. If you are intersted, there is a link to an article for Edlib that I wrote, that explains inner workings in detail, so that could shine some light on your question also.