Sheffield-iGEM / syn-zeug

A modern toolbox for synthetic biology
https://sheffield-igem.github.io/syn-zeug/
GNU Affero General Public License v3.0
6 stars 3 forks source link

Implement New Tool: "Hamming Distance" #4

Closed TheLostLambda closed 2 years ago

TheLostLambda commented 2 years ago

What should this tool do? This tool should count the number of differences or "point mutations" when two equal-length sequences are aligned and compared.

Is there an existing reference implementation? See this Rosalind Implementation.

What are the tool's inputs? Two equal-length sequences (of DNA, RNA, or Protein)

What is the tool's output? A usize count of the bases / residues that differ

Other Implementation Details See these Examples for a better idea of how hamming distance is calculated!

TheLostLambda commented 2 years ago

I've just found this in my adventuring about: https://docs.rs/bio/latest/bio/alignment/distance/fn.hamming.html

Not quite sure how I missed that before... Since you've already written your own version, I say you open a PR with that one and we can compare to the RustBio version in the future :)

adam-spencer commented 2 years ago

From having a look at the RustBio version, this tool is incredibly simple and has a minimum time complexity of O(n). They are using the fastest possible method AFAIK - would it be worth just making a call to bio::alignment::distance::hamming rather than re-writing their function? I'd understand rewriting if you'd like to remove the bio/ folder for 'production', though.

I suppose the only work otherwise would be to allow a user to input two sequences simultaneously; from this the hamming_distance() function could just pass the content of the sequences into the RustBio implementation. Is this something you'd like me to have a shot at or are you working on this independently? (If it's already in there somewhere please forgive my lack of reading)

TheLostLambda commented 2 years ago

Looking at the source code, it does look to be about as simple as it gets! And I think we'll always build on top of Rust bio, even after I upstream the changes and move the fork out of this repository, so I'm fully on board with just calling the rust bio version and wrapping some things in our nicer method interface + creating a new error type for reporting the sequence length mismatch error (it would be ideal if we never triggered that assert_eq! here: https://docs.rs/bio/latest/src/bio/alignment/distance.rs.html#27-42

While you're at it, it would be great to add the levenshtein distance as a second method — again just wrapping the rust-bio version: https://docs.rs/bio/latest/bio/alignment/distance/fn.levenshtein.html

It might be worth benchmarking things if you think there might be a faster way to write something. The GC content code that we ended up with is quite a bit faster than the code in rust-bio. Same time-complexity obviously, but just quicker.

Overall though, wrappers for both of those distance functions would be great!

adam-spencer commented 2 years ago

Got it!

Do you already have a method for inputting multiple sequences at the same time? Without this interface there's no way to compute hamming distance between two independent sequences.

TheLostLambda commented 2 years ago

In terms of UI, that's something that still needs to be worked on, but the library half of things should be fine to write now. I'd make the type of the method something like pub fn hamming_distance(&self, other: &Self) -> Result<usize, Error> or something like that where you can call it like seq1.hamming_distance(&seq2) or something similar. I haven't actually tested that type of fully thought through it though, so feel free to change it some if need be!

Does that answer your question alright?

adam-spencer commented 2 years ago

That's perfect, was just asking to ensure I hadn't missed something. Thanks!

adam-spencer commented 2 years ago

Another question; should it be possible for a user to computer either distance measure on sequences of differing kinds? (i.e. compare DNA and RNA)

TheLostLambda commented 2 years ago

Another question; should it be possible for a user to computer either distance measure on sequences of differing kinds? (i.e. compare DNA and RNA)

After doing some thinking on this, I don't think that would really make too much sense. I could imagine some people wanting to perhaps in some quite edge cases, but I think that the majority of the time it would be an accidental comparison between similar things like DNA and RNA that give unhelpful results.

I'd allow comparing between different alphabets (IUPAC vs N or Base should be fine) but I think that comparing different kinds should throw an error personally!

adam-spencer commented 2 years ago

Hey! I've been at home this week thus not been doing much work - just restarted on the train.

I've implemented wrappers for Hamming and Levenshtein distances (yet to write tests) and from running some benchmarks have found that bio::alignment::distance::levenshtein is very slow. Levenshtein runs in O(n*m) [where n,m are the lengths of the sequences being compared] so I'll have a look at creating a more optimised version myself.

This may be worth your attention because I'm by no means a knowledgable Rust programmer :-)