choderalab / assaytools

Modeling and Bayesian analysis of fluorescence and absorbance assays.
http://assaytools.readthedocs.org
GNU Lesser General Public License v2.1
18 stars 11 forks source link

Optimize GeneralBindingModel #90

Open jchodera opened 7 years ago

jchodera commented 7 years ago

From some profiling tests, it looks like root solving in the GeneralBindingModel is the dominant cost of the new autoprotocol-based assaytools changes, with total cost being somewhere around 76% of total sampling time spent in calls to scipy.optimize.root here.

I imagine we could greatly speed this up by rewriting the target function using some optimization scheme.

jchodera commented 7 years ago

@pgrinaway @gregoryross @maxentile : If you had any advice on how to optimize the target function with minimal effort, I'd be very appreciative of suggestions!

pgrinaway commented 7 years ago

Taking a look at this now.

jchodera commented 7 years ago

Thanks!

I realized that even just encoding the quantities as numpy arrays outside of the target function will likely lead to a big speedup, but I was wondering if there was some sort of numba or other thing that might trivially speed this up.

pgrinaway commented 7 years ago

Ok, the most obvious way of optimizing this would be to implement it in Cython, but there are a few tricky points:

It's not substantially easier with numba--with both numba and Cython you'll be able to take this code and directly compile it (I think--numba might not be happy with the closure, I've never tried it), but you probably won't see much performance increase.

So, it would require a bit of re-thinking how the function works (which might even in pure python/numpy make the code faster). I can take a stab at this, if you want.

pgrinaway commented 7 years ago

I don't think numba would be trivial (to get a speedup), but it's easy to try.

jchodera commented 7 years ago

I think the key to speeding things up is to speed up just the target function that gets called many times by the root finder. It accepts a numpy array and returns two numpy arrays. I should be able to move some logic outside the target function and do something to speed up the computation inside the target.

We can still have the binding model funciton accept and return dicts, but it can do some preprocessing so that the target function runs as fast as possible on numpy arrays.

maxentile commented 7 years ago

Also took a look!

Yeah, numba doesn't know what to do with dicts (code using dictionaries will run in "object mode" / as slow as interpreted code), so those will need to be flattened out into arrays beforehand for numba to have any effect...

@jchodera : Could you point to a few input instances to optimize for? On the input in the doctest, the Pycharm profiler says the most time-consuming functions are in blas, so I think loop-jitting might not produce a substantial speed-up on that instance. However, if there are a lot more reactions, species, or conservation_equations, the interpreter overhead would be more substantial

jchodera commented 7 years ago

Here is the description of the calculation in equations, for reference:

http://assaytools.readthedocs.io/en/latest/theory.html#general-binding-model

jchodera commented 7 years ago

Whoops, let me post the branch name and info on how to run the example shortly. Sorry about that!

I do realize we don't want to have any dict processing inside the target function. I can do preprocessing outside the target and feed numpy arrays in. But is that the best I can do?

pgrinaway commented 7 years ago

I think the key to speeding things up is to speed up just the target function that gets called many times by the root finder. It accepts a numpy array and returns two numpy arrays. I should be able to move some logic outside the target function and do something to speed up the computation inside the target.

Right, this is what I meant. If we deal with dicts and resizing lists inside the target function, it's going to be pretty difficult to optimize.

pgrinaway commented 7 years ago

I do realize we don't want to have any dict processing inside the target function. I can do preprocessing outside the target and feed numpy arrays in. But is that the best I can do?

Once you've done that, you can give numba a try. But my experience with numba is that you end up having to write essentially Cythonic /C-style code to get the substantial speed up: see what we ended up with in perses. All types are fully specified, and the numba compiler is even instructed to fail if it can't avoid using the interpreter. I spent a bit of time gradually optimizing those routines, and ended up with what you see in that link.

Same with Cython--you can see what I did in yank. I also gradually made the code more and more C-like to get maximum performance.

pgrinaway commented 7 years ago

At first glance it might not work, but you might be able to use numexpr somewhere. I suspect that might take as much effort as refactoring for Cython, though.

jchodera commented 7 years ago

Thanks! This is super helpful!

jchodera commented 7 years ago

No need to spend time with this, but since someone asked, you want to check out the optimizer-speedup branch (which prints some output about how much time is spent in the fzero() call), and try running examples/autoprotocol/single-wavelength-probe-assay.py if you want to try it out.

@maxentile may be totally right that the time-consuming part may not be the target function computation. I hadn't even considered that!

I had tried adding an LRU cache to store the most recent 128 or 1024 evaluations, but that did not seem to improve performance.

pgrinaway commented 7 years ago

Ok, I've profiled that script, and the top two functions that are causing slowness are:

1) values from WeakValueDictionary

2) ftarget (this is actually 47% of the total time, though its "own" time is 8.5%). One of the big consumers there is new_method, which is something deep in PyMC.

I will post the profiling results in full later, but this does suggest that ftarget could stand to be optimized.

pgrinaway commented 7 years ago

Ah, new_method is a function inside the function create_uni_method in CommonDeterministics. Not sure why it gets called every time ftarget does.

pgrinaway commented 7 years ago

Sorry for the spam, but it seems to me like it is bits of PyMC that are quite slow, and that the ftarget function itself might not be so bad.

jchodera commented 7 years ago

That's really odd. The ftarget function should be called many times each call to equilibrium_concentrations (which is called once inside a couple of deterministics). pymc may be doing some weird wrapping of things, or perhaps a weird unwrapping of things where pymc variables are being substituted for floats deep inside the ftarget function.

I wish I knew more about how pymc worked, since this suggests that we could speed things up by unwrapping the input pymc variables once and feeding them to a pre-constructed function that only dealt in numpy floats to do the optimization with pre-computed matrices.

In other news, trying to figure out how to recode this all for tensorflow or edward is also not proving easy, since tensorflow compute graphs can't have nodes that are iterative operations like function solvers.

pgrinaway commented 7 years ago

That's really odd. The ftarget function should be called many times each call to equilibrium_concentrations

It is called several times more than equilibrium_concentrations, you're correct.

pymc may be doing some weird wrapping of things, or perhaps a weird unwrapping of things where pymc variables are being substituted for floats deep inside the ftarget function.

I suspect this is the case. It seems like some sort of pymc variable is being instantiated each call, which, according to profiling, is extremely expensive.

In other news, trying to figure out how to recode this all for tensorflow or edward is also not proving easy, since tensorflow compute graphs can't have nodes that are iterative operations like function solvers.

Right, that's unfortunate. What about just coding it up in regular Python with numpy? It would probably be relatively fast, pythonic, nicely extensible using just Python, etc., and if some further optimization is necessary, numpy plays very nicely with Cython.