Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
492 stars 162 forks source link

Infix (HW) alignment mode does not produce different result than standard NW on long sequences #181

Closed rlorigro closed 3 years ago

rlorigro commented 3 years ago

Describe the bug Infix alignment mode (EDLIB_MODE_HW) does not appear to produce any different result than standard NW on long sequences.

To Reproduce Run this executable: https://github.com/rlorigro/overlap_analysis/blob/main/src/test/test_edlib.cpp

Using this dataset: https://github.com/rlorigro/overlap_analysis/blob/main/data/shasta_TinyTest.fastq

Expected behavior I would expect that the alignment does not start from 0,0 in this case, but it still does.

Environment (please complete the following information):

image In the above example, the exact matches as computed with Mummer4 are shown in purple, and the edlib alignment using HW mode is shown in orange. Would you expect that this happens? The relationship between these 2 sequences is that they have partial overlap:

aaaaaaa
   ||||
   bbbbbbbbbbb
Martinsos commented 3 years ago

Hi @rlorigro, thanks for using edlib :)!

Just to make sure that it is clear what HW does -> it is best suited when query is substring of target sequence, when it is cointained in it -> that is why we also call HW and "infix" method. If you need to deal with the overlap like you have above, where end of one sequence matches the start of another, then edit distance (and therefore edlib) is not a good solution, because for such case, there is always a trivial solution where edit distance is 0 -> just put them one after each other. There are other, better scoring mechanisms that can work with this. That said, let's investigate this case because there are more things here that are confusing me.

I am not sure if I understand the graph: I am guessing the y axis is query and x is the target sequence. What about the incline of the slope -> I get that 45 incline is a series of matches, but what is that narrower slope? It seems that Edlib decided to do a sequence of insertion/deletion/insertion/deletion/insertion/deletion/... and that is why it looks like that? I didn't expect that I have to admit, I would expect a series of just insertions or just deletions, similar to the purple line. That might be interesting to investigate. But ok, that in-self doesn't affect the correctness. So the problem is that mummer4 seems to find the bigger area of match, right? You are right that that seems weird, it should be same. But, this poses question, which scoring system does mummer4 use? Edlib uses edit distance (0s and 1s) -> does mummer use something else? If so, that might be the cause of this behaviour.

One thing I am confused about is, if the two sequence are overlaping, why does mummer give those results? This looks like query is subsequence/infix of the target, not an overlap. Or maybe my understanding of the graph is wrong?

rlorigro commented 3 years ago

Hi Martin,

Sorry for not being more explicit in my explanation. Mummer is an implementation of a suffix array, so it is just an efficient way to find matching substrings between any two sequences. In this case it shows all the query substrings (y axis) greater than length 10 that match to a unique sequence in the target (x axis).

If there are no purple pixels then it is safe to assume that there is no "real" alignment between the sequences in that region. The plot seems to show that edlib is taking the path of a global alignment, starting at 0,0 and finding some partial reward during the middle portion in which it overlaps with the matches highlighted in purple, and finally branching off and terminating at (x_size,y_size).

I guess this defies my expectation because I would think that infix mode would allow at least one of the ends to terminate without reaching the corner of the matrix. Though I am probably misunderstanding its definition.

rlorigro commented 3 years ago

Actually I just noticed that you are correct, in this matrix, the matches extend through all of the y coordinates but only part of the x coordinates, so this really should be the ideal case for infix mode.

Martinsos commented 3 years ago

@rlorigro from the chart, this does indeed look like a perfect task for infix (HW) mode -> it allows gaps on both sides of query (but does not allow any gaps for target). If I am correct, for HW it is important which sequence is query and which one is target. Maybe you provided longer sequence as a query and shorter one as a target? That would explain why HW is not applying any gaps. In that case, just swap the sequences. Looking at your code, that seems to be it! I believe if you just swap the order of sequences when providing them as arguments for edlib.align(), you will get correct results.

Hmm, I wonder why this mistake happened -> maybe documentation is not clear enough that order is important? Probably right?

rlorigro commented 3 years ago

Ah I see now. The first argument for edlibAlign is specifically the query, and that determines which sequence is contained by the other.

Well I wouldn't consider it a problem. Just my own hasty coding. This is also the first time I have come across this particular "infix" method, so I was expecting different behavior I think. I need to be able to support overlaps as well for my use case.

Martinsos commented 3 years ago

Ok, great that it works now! I will be closing the issue then.

Overlap -> unfortunately, edit distance can't theoretically work on overlaps. You will need to use some other scoring system, one that adds points instead of penalties and therefore aims for maximal score instead of minimal score. Local alignment algorithms like SW can be a good fit for this.