Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
506 stars 165 forks source link

Question about time complexity #108

Closed corwinjoy closed 6 years ago

corwinjoy commented 6 years ago

Thanks for writing such a great library! I have a question about the time complexity of this method. I am trying an HW alignment for DNA sequences with k mismatches. Suppose that I have a query length of Q, and a search text length of T. What is the time complexity for the run? In the readme, you mention that this is O(Q*T) but this seems to not be right. So, e.g. with k=max mismatches = 6000, I see Q T time 20,000 20,000 0.073s 10,000 40,000 0.024s 40,000 10,000 0.024s

I think that the method is maybe something like O(k * min(Q, T) ^ 2 + abs(Q-T)) but I really don't know. It actually can scan and align into large blocks of text quite fast as long as Q is not too large. Thanks in advance for your help, I tried looking through your thesis to find a discussion of this but I didn't see it!

Martinsos commented 6 years ago

Hi @corwinjoy, that is an interesting question! For HW, let's assume Q is < T because that makes observation a little bit simpler and is usually the case. Also, let's say we have fixed k.

I wrote that complexity is O(Q*T) because we are calculating matrix with Q rows and T columns. If there was no band shrinking (k) and stopping when optimal result is found (which I do in Edlib), this would certainly be correct. However, with band shrinking and early stopping, I found it much harder to correctly represent time complexity through mathematical expression. It is also important to notice that here I am using O, which stands for worse-case complexity, and not average-case complexity.

So let's observe the worst case: lets say Q is AAA and T is AAGAAGAAGAAA: Q has optimal alignment at the end of the T. In that case, we are going to have to calculate the whole dynamic matrix to get so far, there will be no early stopping, and band (of width k) is not going to shrink much. Therefore, this will take O(QT) time complexity! Although, I must say one thing I am not completely sure about is how band will behave in this case, I think it will not shrink much but I would maybe complexity is actual something between O(QT) and O(kT)? Hm I am not completely sure. O(QT) is certainly upper boundary.

Truth is, in most cases this will not happen and I am sure average complexity is better than worst complexity, however from this worst complexity is O(Q*T), and I don't know how to quantify the average-case one.

Btw., when k is not fixed, things get even more complicated, because edlib first tries caclulation with k=64, then with k=128, and so on (k = k * 2), until it finds optimal solution. Basically it spends about 2 times more time on calculation then if it immediatelly knew the k close to optimal solution.

Please correct me if I got anything wrong here or if you have any suggestions, I admit I am not exactly sure how band is behaving in different situations and therefore affecting complexity, but still think that O(Q*T) is a pretty good description.

corwinjoy commented 6 years ago

Thanks for getting back to me - that is very helpful! I need to read up a bit more on the theory behind the algorithm and take a closer look at your code to understand the average and worst case complexity here. What you are saying makes sense and I think this is a good starting point for me!

Martinsos commented 6 years ago

Awesome, I am glad it helps :). I wonder, why are you interested in time complexity at all?