numba optimizations for gaussians

This PR implements a number of exploratory optimizations for convert_to_gaussians(), get_gaussian_gradients(), and compute_gaussian() using Numba.

First, there's a lot of NumPy linear algebra operations (inverse, trace, determinant, etc) on 2x2 matrices; these can be optimized pretty effectively by writing out all the terms that need to be computed and compiling that with Numba. Those are implemented as the "fast_*_2x2" functions in gaussmodel.py. The fast matrix inverse makes the most impact.

In convert_to_gaussians() and get_gaussian_gradients(), there's some preamble that heavily uses the object model of forcepho, and then some inner loops. The object stuff is difficult to deal with in Numba, so in an effort to at least compile the loops, I've separated those inner parts into their own functions (called _convert_to_gaussians() and _get_gaussian_gradients()). it does require passing an annoying number of parameters, though. The performance is good; in both routines, the "loop functions" take less than 10% of the total runtime now. The majority of the remaining runtime in both functions comes from SciPy's SmoothBivariateSpline interpolation. We could think about trying to re-implement that in Numba, although I think SciPy is already calling some pre-compiled routine internally.

To avoid having to pass back lots of return arrays from the "loop functions", I've changed the ImageGaussian container class to a jitclass that we can manipulate with Numba (it works like a C struct). This doesn't seem to break the non-Numba parts of the code I've tested, but @bd-j will have to say whether this is an allowable change to the data model.

Another benefit of using a jitclass is that compute_gaussian() can be jitted without too much trouble. I've turned off oversampling for now since that doesn't seem to be a complete feature and complicates the jitting.

You'll definitely want to use the latest version of Numba when testing this! Every Numba release brings loads of bugfixes and performance improvements; I'm using version 0.43 from pip.

Some performance numbers, using some of the testing harness from @bd-j's timing.py script (but the %timeit IPython magic timer instead of time.time()):

Old:

convert_to_gaussians(): 266 µs ± 2.09 µs
get_gaussian_gradients(): 911 µs ± 9.95 µs
compute_gaussian(): 74.1 µs ± 1.68 µs

New:

convert_to_gaussians(): 104 µs ± 1.23 µs
get_gaussian_gradients(): 209 µs ± 1.17 µs 
compute_gaussian(): 20.6 µs ± 90.1 ns

There are definitely more potential gains, e.g. from parallelization. But I thought I'd stop here and see if the interface/data model changes were acceptable at this point! it would also be good to know more about typical values of ngauss, npix, ngal, nstamp, etc, to know which of the un-compiled loops might need compiling.

bd-j / forcepho

numba optimizations for gaussians #22