Closed adam-spencer closed 2 years ago
Haha, you're definitely right about that awful time complexity on the levenshtein distance Unfortunately it looks like O(n * m) is O(N^2) in our case and it can't be any better algorithm-wise: https://en.wikipedia.org/wiki/Levenshtein_distance#Computational_complexity
Maybe there is some room in improving raw performance though!
Also, rust-bio
has some fancy SIMD versions of the distance functions that speed things up quite a lot! For hamming distance I get:
And for levenshtein:
Less dramatic, but still nice to have!
Also sounds like performance is better when the sequences are more similar, and I think you chose a near worse-case for that with the IUPAC dna (lots of characters and the reverse sequence is likely a very large distance away from the first!)
I think it's best to benchmark that worst-case, but more realistic applications should hopefully run even faster!
I've implemented, tested and benchmarked wrappers for Hamming and Levenshtein Distance. The Levenshtein distance wrapper is slow because the bio implementation is slow - would be worth someone more knowledgeable having a look.
Closes #4