Closed TheLostLambda closed 2 years ago
This new verification is a bit of a nightmare. I've fixed the O(n * m) time complexity where a find
and is_word
based approach would rescan the whole sequence if it failed to validate and needed to try a different alphabet at the end (running is_word
again for every alphabet – up to 8 times – over the whole input), but that fix slows down the best-case algorithm. Despite my best efforts, I can't quite manage to get this iterate over seq
approach to be as fast as the iterate over ALPHABET
approach (perhaps this is because my loop contains hard-to-optimize branching)?
Anyways, if I can't crack that (and I really should be giving up on that soon), then I should try to implement a hybrid approach, so that when the function is called with only one legal alphabet, it switches to the faster approach? This is a bit messy because something like dna_iupac
would have 3 alphabets to test in the worst case...
On second thought, I'm actually a god (both when it comes to programming and being humble). I tried a final rewrite on a whim with the outer loop again being the list of alphabets and the inside using position
now instead of find
. It saves the position of the first mismatch, then tries to find the next mismatch with the new alphabet, but it only scans things starting from where it left off!
It's not only the cleanest (i.e. nicest to read / most canonical) but the worst and best cases are now indistinguishable from the old best-case algorithm.
Elegant and fucking fast – checking all possible alphabets with no measurable performance difference from checking just one.
This PR still needs a lot new tests and code cleanup, but the optimization and core functionality is immaculate...
My hubris has cost me dearly... My super fast algorithm only works when every alphabet in the list is a subset of the last. If I'm looping over alphabets, I would need to rescan the already scanned sequence to avoid this bug.
Back to iterating over the sequence and maybe using a Vec::retain()
so that every char has the chance to disqualify an alphabet.
Need to rebenchmark against the old worst and best bases!
To be fair, I should try to work out an ordering where they are always subsets. If the one I find is nice enough I could just roll with that. I'll have to see if that ordering (should only be one possible, if there is one at all) is actually sane. If it's too silly, then we'll have to go back to the drawing board.
Nope, this optimization is doomed. It's a requirement that all already scanned things are also valid in the next alphabet. This is only possible if the next is a superset, but this isn't satisfiable:
assert!(dna.is_superset(&rna) || rna.is_superset(&dna));
That's quite literally a set-theory proof that I'm full of bullshit and need to change the algorithm. Perhaps I'll have to restore that single-possibility optimization then as I discussed before my doomed revelation
Just to document this, I tried looping over the elements of the sequence and calling Vec::retain()
on the list of alphabets, and it was better than the current worst-case, but only barely.
Parallelism with Rayon makes the current worst case faster, but it's still slower than the best-case currently. I think I'll need to roll with the literal original solution and test it against some real data. But I'm not doing that for a while. And a while means the 11th at the soonest.
TL;DR: I need to test with real sequences to see if the worst-case actually comes up often – I'll probably keep the original solution if it doesn't
Maybe also try a hybrid approach where the offending character is used to filter the alphabets, then they restart from the beginning. For rare odd characters, this would skip the rescanning step of the beginning if it's doomed to fail on the same character as the last alphabet
Ultimately, I need real data to test on...
EDIT: Real data looks hard to come by, so this is an optimization worth trying...
Okay, I did the optimization. I'm sorry. But here it is:
It's an optimistic algorithm optimized for speed on its first try, but it makes use of the information gleaned by a mismatch – ensuring the mismatched character is part of the next alphabet that's tried.
This is probably what I want, since these long-rescans are likely rare, but I'm still doing something clever with that mismatch info.
Closes #21