cedadev / cis

Home of the Community Intercomparison Suite.
www.cistools.net
GNU Lesser General Public License v3.0
46 stars 18 forks source link

Ungridded to ungridded performance #9

Open duncanwp opened 7 years ago

duncanwp commented 7 years ago

This originated here: https://jira.ceh.ac.uk/browse/JASCIS-305

One way to speed up collocation (and perhaps other operations) is to output the set of values as a ragged array, then perform the kernel as an array operation.

So e.g. output a numpy array of length output values, but with elements which are numpy arrays of values to operate on. Then just operate on those values as a numpy array wise operation. Not sure how much time this would save, but could be much quicker in some situations.

duncanwp commented 7 years ago

I'm currently of the thinking that we would be better off converting the sample points loop to C, either through Cython or PyPy (or Numba?) rather than trying to vectorise it because I don't think we'll get the flexibility we want that way.

duncanwp commented 7 years ago

By profiling an example CloudSat onto aircraft collocation it appears that the biggest bottleneck is actually the constraint step within the loop. Underneath we use the query_ball_point method to find nearby points for each sample point, but the query_ball_tree method should be much faster as we pass it all the sample points in one go. This moves the constraint step outside the sample point loop.

I've created a branch for this work here: https://github.com/cedadev/cis/tree/optimize_ug_col

duncanwp commented 7 years ago

I've had a bit more of a play with this today. Using query_ball_point does help (speed-up of about 50%) but not as much as I'd hoped - the bottleneck just gets moved to the loop over the constrained points.

I've had a look at ways of dealing with the staggered (jagged) array which you end up with. I think a scipy sparse array is probably the way to do this, but how to construct it and then actually use it to calculate the kernel values is going to need a bit more thought.

The other option is to calculate the whole sparse_distance_matrix in the first place. I've created an implementation for it, but it's very slow once you have many constrained points to calculate the distance of. We'll need this when calculating weighted averages anyway so I I'll probably concentrate on making this faster - using Cython for the main loop I think.