Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

left-align? #96

Closed brentp closed 6 years ago

brentp commented 6 years ago

with a pair of sequences, aligned by hand as:

TTCATTAACAACGAGGCGAAACTGGGCTACTCCATGACCAGGGGCAAAATAGGCTTTTAGCCGCTGCGTTCTGGGAGCTCCTCCCCCTTCTGGGAGCTCCTCCCCCTCCCCAGAAGGCCAAGGGATGTGGGGGCTGGGGGACTGGGAGGCCTGGCAGTCTT
                CGAAACTGGGCTACTCCATGACCAGGGGCAAAATAGGCTTTTAGCCGCTGCGTTCTGGGAGCTCCTCCCCCT                   CCCCAGAAGGCCAAGGGATGTTGGGG

I put the top line into target.fa and the bottom line into query.fa and then align as:

edlib-aligner  -m HW -p -f CIG_STD query.fa target.fa

the reported cigar is:75M2I1M2I2M2I1M1I2M2I4M4I

Is there a way I can get the expected of 75M19I26M

I guess this is because it's not left/right aligning indels?

thanks, -Brent

Martinsos commented 6 years ago

Hi @brentp, sorry for late response! Maybe the most important thing to say here is that Edlib is implementing edit distance, meaning that gaps are equally bad as mismatches and there is no way for it to prefer one over another. Every insertion, deletion or mismatch is penalized with 1 point, and edlib will find such alignment that minimizes the total sum of these points.

What you are expecting aligner to do is to find a solution that has 75 matches, 19 insertions and 26 matches after that. Such solution would have edit distance of 19. The thing is, it turns out that when using edit distance, that solution is not the best one! Best solution has edit distance of 18, and is the one that Edlib reported. CIG_STD may be a little bit misleading, because it shows and groups both matches and mismatches as the same operation (M), that is how it is defined. If you use CIG_EXT instead, or just remove the -f CIG_STD so it uses default "pretty" output, you will see that there are both insertions and mismatches in optimal solutions, and there is total 18 of them all together, which is better than 19 insertions.

Below you can see 3 different representations of same alignment/solution:

Standard cigar (CIG_STD): image

Extended cigar (CIG_EXT): image

Pretty output: image

So this solves the mistery of why Edlib did not return the solution that you expected -> because solution you expected is not an optimal solution!

I guess this still does not solve your problem, and that is that you want aligner to behave differently and to prefer solution with big gaps, because that makes more sense from the biological point of the view. If you prefer gaps (long gaps) over mismatches, you should use an algorithm that implements Gotoh or accepts custom penalties for mismatches and gaps, so you can put lower penalty for gaps -> there are many libraries out there implementing such algorithms, with reasonable defaults. Unfortunately, edit distance is not an algorithm that can accommodate this.

I hope this helps, I am closing the issue but please let me know if something is not clear!

brentp commented 6 years ago

wow, thanks for the careful reply. I understand completely. I ended up using ksw2 because it allows setting gap-open, extend but I did write a nim wrapper for edlib with (IMO) a rather nice syntax here that I'll surely revisit later.

Martinsos commented 6 years ago

@brentp I am glad it helped! Edlib is mostly limited by being only edit distance library, but it is also what gives it its speed and keeps the interface relatively simple. Although it was made with bioinformatics in mind, I also make sure to keep it a general solution for edit distance in any use case where it might be needed. ksw2 seems like a great match for what you need! Oh wow, I actually never heard about nim, but it is really cool that you wrote an edlib wrapper for it. Seems like a cool language, although I guess it has hard time competing with Rust and Go? If you decide to make nim wrapper a thing, you could make it into a repository and I will link to it from my repository. If I knew Nim I would probably even add it to bindings in my repo, but I don't so I think it is better to keep it under your control as a separate project.