Sheffield-iGEM / syn-zeug

A modern toolbox for synthetic biology
https://sheffield-igem.github.io/syn-zeug/
GNU Affero General Public License v3.0
6 stars 3 forks source link

Consider Performance Improvements from an Iterator API #19

Closed TheLostLambda closed 2 years ago

TheLostLambda commented 2 years ago

Currently, if you're chaining tools together, every tool needs to pass over the sequence once. This can result in a lot of repeated passes over the same sequence. Rust solves this problem using lazy iterators that can be built up to carry out complex transformations in a single pass, but I'm not sure how this API would look in syn-zeug.

I need to figure out how to implement methods on iterators, how they are chained, and how things look when exported to WebAssembly. It's not much of a problem yet, but it could massively speed up longer pipelines in the future. I'd need to carry out some real-world benchmarks to decide if it's worth the trouble.

Food for thought?

TheLostLambda commented 2 years ago

I think the proper way to do things here is definitely to use some sort of "iterator-like" type. I think I should create a SeqIter and make use of the delegate to share methods between these types so that you don't need to bookend every operation with an iter() and collect().

I don't really know if this is a stellar idea just yet — I think I'd like to essentially keep the user-facing API identical to how it is at the moment, but avoid multiple passes using an intermediate type.

Perhaps also importantly, this is a second place to use the delegate crate, with the first being proper no-copy SubSeqs. I think after this is implemented using delegate, I should revisit the SubSeq as slices with metadata idea.

TheLostLambda commented 2 years ago

Okay, new plan because that one seemed a bit messy! I'm writing a new type, LazyVec, which is essentially the same as the unstable std::lazy::Lazy, but with the ability to chain functions on the initialiser.

Perhaps I should make the type a little more general overall, it could be something like LazyChain and just add a method to Lazy for taking the initialiser function and chaining it with another

TheLostLambda commented 2 years ago

Well, I've done a bit more thinking on this, and it seems like the only two ways to store an iterator in a struct are to use generics or a trait object (as iterator chains generate unique, nested types: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=9c7d8ed6d1b7d4adf7cd03441b61503b)

Because I certainly don't want to leak any internal implementation details in the type of Seq, I think my hand is forced to use trait objects. I'm not thrilled about the performance implications of this, but with some benchmarking and enough trait constraints (still allowing for optimizations), it might be alright.

I should actually start by setting up some benchmarks, then I'll work from there!

TheLostLambda commented 2 years ago

Well, this is embarrassing (and predictable). This is a massive premature optimization. Trying to convert the translation function to take only iterators (something that would be required to move to this iterator API) nearly halves performance when compared to working on the Vec directly.

The code for this single-pass chaining is complex and, though I've not written it all yet, there doesn't seem to be too much to gain (DNA -> RNA takes only nanoseconds, while translation itself crawls on the order of microseconds, so the additional nanosecond pass is invisible).

I learned a lot in planning this, but I think I'll spin off a new branch (in case I ever want to revisit this) and just move on without this iterator approach for now.

Rest in peaces, premature optimization...

TheLostLambda commented 2 years ago

Here is where things have ended up: https://github.com/Sheffield-iGEM/syn-zeug/tree/iterator_optimisation

TheLostLambda commented 2 years ago

I'll close this for the time being, but might revisit it in the future!