Sheffield-iGEM / syn-zeug

A modern toolbox for synthetic biology
https://sheffield-igem.github.io/syn-zeug/
GNU Affero General Public License v3.0
6 stars 3 forks source link

Completed Hamming + Levenshtein Distance wrappers #45

Closed adam-spencer closed 1 year ago

adam-spencer commented 1 year ago

I've implemented, tested and benchmarked wrappers for Hamming and Levenshtein Distance. The Levenshtein distance wrapper is slow because the bio implementation is slow - would be worth someone more knowledgeable having a look.

Closes #4

TheLostLambda commented 1 year ago

Haha, you're definitely right about that awful time complexity on the levenshtein distance image Unfortunately it looks like O(n * m) is O(N^2) in our case and it can't be any better algorithm-wise: https://en.wikipedia.org/wiki/Levenshtein_distance#Computational_complexity

Maybe there is some room in improving raw performance though!

TheLostLambda commented 1 year ago

Also, rust-bio has some fancy SIMD versions of the distance functions that speed things up quite a lot! For hamming distance I get: image And for levenshtein: image Less dramatic, but still nice to have!

Also sounds like performance is better when the sequences are more similar, and I think you chose a near worse-case for that with the IUPAC dna (lots of characters and the reverse sequence is likely a very large distance away from the first!)

I think it's best to benchmark that worst-case, but more realistic applications should hopefully run even faster!