Open jchodera opened 7 years ago
@pgrinaway @gregoryross @maxentile : If you had any advice on how to optimize the target function with minimal effort, I'd be very appreciative of suggestions!
Taking a look at this now.
Thanks!
I realized that even just encoding the quantities as numpy arrays outside of the target function will likely lead to a big speedup, but I was wondering if there was some sort of numba or other thing that might trivially speed this up.
Ok, the most obvious way of optimizing this would be to implement it in Cython, but there are a few tricky points:
The code appends to lists--we wouldn't want to do that in Cython, since that would require manipulating Python objects (as opposed to numpy arrays which are well-represented in C).
The code uses dictionaries heavily. This would also diminish performance under Cython, because again it would just be calling the Python interpreter.
That function would have to be removed as a closure, and accept all of the data that it needs (with types fully specified) as parameters. This is not very difficult.
It's not substantially easier with numba
--with both numba
and Cython
you'll be able to take this code and directly compile it (I think--numba might not be happy with the closure, I've never tried it), but you probably won't see much performance increase.
So, it would require a bit of re-thinking how the function works (which might even in pure python/numpy make the code faster). I can take a stab at this, if you want.
I don't think numba
would be trivial (to get a speedup), but it's easy to try.
I think the key to speeding things up is to speed up just the target function that gets called many times by the root finder. It accepts a numpy array and returns two numpy arrays. I should be able to move some logic outside the target function and do something to speed up the computation inside the target.
We can still have the binding model funciton accept and return dicts, but it can do some preprocessing so that the target function runs as fast as possible on numpy arrays.
Also took a look!
Yeah, numba doesn't know what to do with dicts (code using dictionaries will run in "object mode" / as slow as interpreted code), so those will need to be flattened out into arrays beforehand for numba to have any effect...
@jchodera : Could you point to a few input instances to optimize for? On the input in the doctest, the Pycharm profiler says the most time-consuming functions are in blas, so I think loop-jitting might not produce a substantial speed-up on that instance. However, if there are a lot more reactions
, species
, or conservation_equations
, the interpreter overhead would be more substantial
Here is the description of the calculation in equations, for reference:
http://assaytools.readthedocs.io/en/latest/theory.html#general-binding-model
Whoops, let me post the branch name and info on how to run the example shortly. Sorry about that!
I do realize we don't want to have any dict processing inside the target function. I can do preprocessing outside the target and feed numpy arrays in. But is that the best I can do?
I think the key to speeding things up is to speed up just the target function that gets called many times by the root finder. It accepts a numpy array and returns two numpy arrays. I should be able to move some logic outside the target function and do something to speed up the computation inside the target.
Right, this is what I meant. If we deal with dicts and resizing lists inside the target function, it's going to be pretty difficult to optimize.
I do realize we don't want to have any dict processing inside the target function. I can do preprocessing outside the target and feed numpy arrays in. But is that the best I can do?
Once you've done that, you can give numba
a try. But my experience with numba
is that you end up having to write essentially Cython
ic /C-style code to get the substantial speed up: see what we ended up with in perses. All types are fully specified, and the numba
compiler is even instructed to fail if it can't avoid using the interpreter. I spent a bit of time gradually optimizing those routines, and ended up with what you see in that link.
Same with Cython--you can see what I did in yank. I also gradually made the code more and more C-like to get maximum performance.
At first glance it might not work, but you might be able to use numexpr
somewhere. I suspect that might take as much effort as refactoring for Cython, though.
Thanks! This is super helpful!
No need to spend time with this, but since someone asked, you want to check out the optimizer-speedup
branch (which prints some output about how much time is spent in the fzero()
call), and try running examples/autoprotocol/single-wavelength-probe-assay.py
if you want to try it out.
@maxentile may be totally right that the time-consuming part may not be the target function computation. I hadn't even considered that!
I had tried adding an LRU cache to store the most recent 128 or 1024 evaluations, but that did not seem to improve performance.
Ok, I've profiled that script, and the top two functions that are causing slowness are:
1) values
from WeakValueDictionary
2) ftarget
(this is actually 47% of the total time, though its "own" time is 8.5%). One of the big consumers there is new_method
, which is something deep in PyMC.
I will post the profiling results in full later, but this does suggest that ftarget
could stand to be optimized.
Ah, new_method
is a function inside the function create_uni_method
in CommonDeterministics
. Not sure why it gets called every time ftarget
does.
Sorry for the spam, but it seems to me like it is bits of PyMC that are quite slow, and that the ftarget
function itself might not be so bad.
That's really odd. The ftarget
function should be called many times each call to equilibrium_concentrations
(which is called once inside a couple of deterministics). pymc may be doing some weird wrapping of things, or perhaps a weird unwrapping of things where pymc variables are being substituted for floats deep inside the ftarget
function.
I wish I knew more about how pymc worked, since this suggests that we could speed things up by unwrapping the input pymc variables once and feeding them to a pre-constructed function that only dealt in numpy floats to do the optimization with pre-computed matrices.
In other news, trying to figure out how to recode this all for tensorflow or edward is also not proving easy, since tensorflow compute graphs can't have nodes that are iterative operations like function solvers.
That's really odd. The ftarget function should be called many times each call to equilibrium_concentrations
It is called several times more than equilibrium_concentrations
, you're correct.
pymc may be doing some weird wrapping of things, or perhaps a weird unwrapping of things where pymc variables are being substituted for floats deep inside the ftarget function.
I suspect this is the case. It seems like some sort of pymc
variable is being instantiated each call, which, according to profiling, is extremely expensive.
In other news, trying to figure out how to recode this all for tensorflow or edward is also not proving easy, since tensorflow compute graphs can't have nodes that are iterative operations like function solvers.
Right, that's unfortunate. What about just coding it up in regular Python with numpy
? It would probably be relatively fast, pythonic, nicely extensible using just Python, etc., and if some further optimization is necessary, numpy
plays very nicely with Cython.
From some profiling tests, it looks like root solving in the
GeneralBindingModel
is the dominant cost of the new autoprotocol-based assaytools changes, with total cost being somewhere around 76% of total sampling time spent in calls toscipy.optimize.root
here.I imagine we could greatly speed this up by rewriting the target function using some optimization scheme.