broadinstitute / poolq

The Genetic Perturbation Platform's tool for deconvoluting and quantifying the results of pooled screens
Other
5 stars 1 forks source link

Split barcode policy optimization #13

Closed mtomko closed 1 year ago

mtomko commented 1 year ago

In the event that we are matching barcodes using a template that contains precisely 2 barcode regions, each preceded by a prefix, it may be faster to skip matching the whole template and instead match the first prefix, skip ahead to the next prefix location, match that with the second prefix, and extract the barcodes if necessary.

This is particularly useful for templates such as

cggtgNNNNNNNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnngttccNNNNNNNNNNNNN

which has a very long fixed region whose contents we don't care about.

In practice, this has perhaps been a more modest improvement than I had hoped, but it is worth keeping the work, I think.

Also technically this should probably be released as PoolQ 3.7.0 but since we don't have a lot of downstream consumers and the renamed class is not really intended for public consumption, perhaps we can get away with calling it 3.6.1?

mtomko commented 1 year ago

I have removed TemplatePolicy.copy in favor of String#getChars.

Regarding benchmarking, PoolQ used to have a jmh benchmark module from the original PoolQ2 to PoolQ3 migration, but it wasn't well maintained and never revealed much that was useful during the PoolQ3 implementation process. I ended up removing it in the process of releasing PoolQ 3.5 because it was out-of-date, unused, and made some aspects of the build harder to work with. I agree that having it would have been helpful for this project and probably others. I'll think about putting it back in some form.