Completed Hamming + Levenshtein Distance wrappers

adam-spencer commented 2 years ago

I've implemented, tested and benchmarked wrappers for Hamming and Levenshtein Distance. The Levenshtein distance wrapper is slow because the bio implementation is slow - would be worth someone more knowledgeable having a look.

Closes #4

TheLostLambda commented 2 years ago

Haha, you're definitely right about that awful time complexity on the levenshtein distance Unfortunately it looks like O(n * m) is O(N^2) in our case and it can't be any better algorithm-wise: https://en.wikipedia.org/wiki/Levenshtein_distance#Computational_complexity

Maybe there is some room in improving raw performance though!

TheLostLambda commented 2 years ago

Also, rust-bio has some fancy SIMD versions of the distance functions that speed things up quite a lot! For hamming distance I get: And for levenshtein: Less dramatic, but still nice to have!

Also sounds like performance is better when the sequences are more similar, and I think you chose a near worse-case for that with the IUPAC dna (lots of characters and the reverse sequence is likely a very large distance away from the first!)

I think it's best to benchmark that worst-case, but more realistic applications should hopefully run even faster!

Sheffield-iGEM / syn-zeug

Completed Hamming + Levenshtein Distance wrappers #45