Sheffield-iGEM / syn-zeug

A modern toolbox for synthetic biology
https://sheffield-igem.github.io/syn-zeug/
GNU Affero General Public License v3.0
6 stars 3 forks source link

feat(seq): add support for different alphabets #30

Closed TheLostLambda closed 2 years ago

TheLostLambda commented 2 years ago

Closes #21

TheLostLambda commented 2 years ago

This new verification is a bit of a nightmare. I've fixed the O(n * m) time complexity where a find and is_word based approach would rescan the whole sequence if it failed to validate and needed to try a different alphabet at the end (running is_word again for every alphabet – up to 8 times – over the whole input), but that fix slows down the best-case algorithm. Despite my best efforts, I can't quite manage to get this iterate over seq approach to be as fast as the iterate over ALPHABET approach (perhaps this is because my loop contains hard-to-optimize branching)?

Anyways, if I can't crack that (and I really should be giving up on that soon), then I should try to implement a hybrid approach, so that when the function is called with only one legal alphabet, it switches to the faster approach? This is a bit messy because something like dna_iupac would have 3 alphabets to test in the worst case...

TheLostLambda commented 2 years ago

On second thought, I'm actually a god (both when it comes to programming and being humble). I tried a final rewrite on a whim with the outer loop again being the list of alphabets and the inside using position now instead of find. It saves the position of the first mismatch, then tries to find the next mismatch with the new alphabet, but it only scans things starting from where it left off!

It's not only the cleanest (i.e. nicest to read / most canonical) but the worst and best cases are now indistinguishable from the old best-case algorithm.

Elegant and fucking fast – checking all possible alphabets with no measurable performance difference from checking just one.

This PR still needs a lot new tests and code cleanup, but the optimization and core functionality is immaculate... lines

new.zip

TheLostLambda commented 2 years ago

My hubris has cost me dearly... My super fast algorithm only works when every alphabet in the list is a subset of the last. If I'm looping over alphabets, I would need to rescan the already scanned sequence to avoid this bug.

Back to iterating over the sequence and maybe using a Vec::retain() so that every char has the chance to disqualify an alphabet.

Need to rebenchmark against the old worst and best bases!

TheLostLambda commented 2 years ago

To be fair, I should try to work out an ordering where they are always subsets. If the one I find is nice enough I could just roll with that. I'll have to see if that ordering (should only be one possible, if there is one at all) is actually sane. If it's too silly, then we'll have to go back to the drawing board.

TheLostLambda commented 2 years ago

Nope, this optimization is doomed. It's a requirement that all already scanned things are also valid in the next alphabet. This is only possible if the next is a superset, but this isn't satisfiable:

assert!(dna.is_superset(&rna) || rna.is_superset(&dna));

That's quite literally a set-theory proof that I'm full of bullshit and need to change the algorithm. Perhaps I'll have to restore that single-possibility optimization then as I discussed before my doomed revelation

TheLostLambda commented 2 years ago

Just to document this, I tried looping over the elements of the sequence and calling Vec::retain() on the list of alphabets, and it was better than the current worst-case, but only barely.

Parallelism with Rayon makes the current worst case faster, but it's still slower than the best-case currently. I think I'll need to roll with the literal original solution and test it against some real data. But I'm not doing that for a while. And a while means the 11th at the soonest.

TL;DR: I need to test with real sequences to see if the worst-case actually comes up often – I'll probably keep the original solution if it doesn't

TheLostLambda commented 2 years ago

Maybe also try a hybrid approach where the offending character is used to filter the alphabets, then they restart from the beginning. For rare odd characters, this would skip the rescanning step of the beginning if it's doomed to fail on the same character as the last alphabet

Ultimately, I need real data to test on...

EDIT: Real data looks hard to come by, so this is an optimization worth trying...

TheLostLambda commented 2 years ago

Okay, I did the optimization. I'm sorry. But here it is: lines

It's an optimistic algorithm optimized for speed on its first try, but it makes use of the information gleaned by a mismatch – ensuring the mismatched character is part of the next alphabet that's tried.

This is probably what I want, since these long-rescans are likely rare, but I'm still doing something clever with that mismatch info.