GeoStat-Framework / PyKrige

Kriging Toolkit for Python
https://pykrige.readthedocs.io
BSD 3-Clause "New" or "Revised" License
759 stars 188 forks source link

Working with a large grid size #2

Closed rth closed 9 years ago

rth commented 9 years ago

Hello,

thanks you so much for working on this and making it available on github!

I would like to use this module with a moderately large grid size of the order of nx=1000, ny=1000 and with a number of points n=100. Given that UniversalKriging solves the linear system, A X = b, where A is, I believe, an array of shape (nx, ny, n, n), this currently requires roughly nx*ny*n**2*8/1e9= 80 GB of RAM simply to define that matrix in 64bit , and subsequently I'm running out of memory on my laptop.

I was wondering if I was possible to optimise this code so it is more memory efficient? For instance, the option would be to port the execute method in uk.py to Cython, so it uses a lower number of temporary numpy arrays for indexing, etc., which would same some memory. However, at the end of the day, it's still necessary to define that same A matrix (which is not sparse as far as I can tell) for the linear system. I suppose, it's not possible to do a domain decomposition on the grid (decompose it in regions and do the calculation region by region), right? Is there any other things that could be attempted? I can do some optimisation , I would just need some advice about where to start since I don't know much about Kriging.

Thanks,

bsmurphy commented 9 years ago

Interesting, I hadn't actually anticipated such an issue when I redesigned the code to solve the whole system with a single massive array. But of course, such a large array would indeed take up too much memory. The previous version of the code (v 1.0.1ish) would probably have worked for you, as it actually loops through each grid point and sets up a different Ax=b system to solve each time. The problem of course is that you have to loop through each point in the grid, which in your case is quite large.

Regarding the kriging business, you're right about the A matrix, it isn't sparse. Working through subsections of the grid would be reasonable to cut down on the massive-array issue, as long as you of course incorporate all of the data each time. So you'd maybe have to loop through solving a couple of giant matrix equations with array A of shape (nx_sub, ny_sub, n, n). (Also, I think that's the right shape, although I can't recall exactly what I did right off the top of my head right now... The idea is that you have an n x n matrix defined at each of the grid points and each of those matrices is the A matrix in an Ax=b matrix equation at each grid point.) This could possibly be a good solution, as it may be faster than actually having to loop through each point in the (large) grid and still avoid using too much memory.

I'll take a look at the code again later this week and see what changes (i.e., provide the option to loop through grid points, enable sub-grid workarounds, or something else) I can make to alleviate this problem. If you need to get to it soon, feeding in sub-grids might be a good way to go. As I said, as long as you feed in all the data to the sub-domains, you should be good. The code shouldn't care if your data are inside or outside of your defined grid.

And thanks for your interest in the code! Really glad to see that people are actually using this.

bsmurphy commented 9 years ago

Actually would've been v 0.2.0 I think, which you might be able to get directly from the pypi website if you want to try to use the loop-through version

rth commented 9 years ago

Thanks a lot for the explanations, it makes more sens now! I just submitted to PR with a fix, that reuses the previous version with a loop and speed it up a bit. Here is some result of different optimizations for your code,

version RAM usage nx=150, ny=150, npt=100 nx=50, ny=50, npt=300
I. Original O(nx* ny * npt**2) 7.303 s 8.714 s
1. Pure python optimization (in PR) O(nx*ny) 0.45 s 0.1 s
2. Cython optimized (mem_opt branch) O(nx*ny) 0.079 s 0.068 s
3. Vectorized python version (in PR) O(nx * ny * npt) 0.086 s 0.05 s
  1. Is the PR above (python with a loop over grid elements)
  2. Is a Cython optimized version in the mem_opt branch. Essentially, it uses that same version with a loop, rewritten in Cython and with low level calls to BLAS through scipy.linalg for the matrix vector multiplication A^(-1) b. This allows a speed up of 2x to 6x, compared to the optimized pure-python version.

    This version can be called with

           z, ss = UK.cexecute('grid', grid_x, grid_y,  backend='C'):

    I can do a PR for this one too, although it is only a partial implementation, that needs to be extended/tested more. Furthermore, it might make things more complex to install, since setup.py would then need cython and a C compiler, so at this point I don't think that doing a PR to the master is a good idea. If you are interested though can do a pull request to a new branch if you create one. Or any other way you want to handle this.

  3. Vectorized python version (in PR), same thing as v1. but using numpy functions, without the loop over grid points. Similar performance to 2, but uses much more memory.

Still, even with the version 3, it would still take roughly 2 min to process a case with nx=3000, ny=3000, npt= 200, which is quite a lot, and the time to solution seems to be quadratic with npt.

Edit: added case 3 above.

rth commented 9 years ago

I have been thinking about this, and while the above optimisations do speed things up a bit, they do not solve the fundamental problem. The scaling, as the problem size increases just doesn't seem to be right.

For instance say we have n data points and want to use Oridinary Krigging to compute the value in on a single grid point. We then define the matrix A or shape (n+1, n+1) and solve the system A*x = b for this grid point (wheter with scipy.linalg.solve or inverting the A matrix beforehand). Which has a time complexity O(n^2), and no mater how you optimise it this is going to be slow if say n=1000. At the same time, intuitively, to calculate the value on a given grid point, it shouldn't be necessary to take into account the all the datapoints, and only building a matrix A with the k closest data points should be sufficient, no? The problem will then become local and not global as it is now, and the scaling/execution time will be much improved.

The logic behind this argument, is that a number of software products are able to do Ordinary Krigging of large grids in a matter of seconds, while with the current implementation in this module, even if it were to be written in a lower level language such as C, that won't be possible, since the bottleneck is the solution of the system A*x=b, as mentionned above. For instance, in the HPGPL (see GSLIB Benchmark section) they do define Search Radius = (20, 20, 20) and Max Neighbourhoods = 12 parameters which seem to confirm this logic. I would use use HPGPL exept that they ship a reference BLAS/LAPACK implementation which you have to complile, while relying on an optimised BLAS/LAPACK (e.g. OpenBLAS etc) through the scipy interface is much simpler to use and can potentially be faster.

Do you think that is is fine, to do define the A matrix only for the nearest k points from the current grid point? If it is I can look into the implementation which should be only a minor modification of the current apporach...

bsmurphy commented 9 years ago

Kriging with a moving window/neighborhood is certainly a thing that can be done, although there are a few subtleties that would need to be taken into account. I haven't implemented it yet because I personally have never used that approach and it has never seemed to be an urgent need for other people I talked to in developing the code. Also, I have always used kriging for statistically contouring data; as Kitanidis (see the reference in any of the main *.py files) notes, using a moving neighborhood for contouring data is not necessarily a good idea. Of course, N=1000 points is certainly on the upper extreme of a typical kriging problem, so this may be a case in which a moving neighborhood is warranted.

For kriging with a moving window to make computational sense, you'd first have to carefully select your variogram model. We'd be assuming that the variogram is spatially stationary (not a bad assumption, and we're already doing that when kriging with all of the data), and we'd have to make sure that the fit of the model variogram to the experimental variogram is better at shorter lags (shorter distances). We'd also want to choose the variogram such that the calculated weights decrease somewhat rapidly with distance and are quite small at the edges of the window. Maintaining somewhat large weights for points on the very edge of the window could possibly lead to weird discontinuities appearing in the final kriged grid (see Kitanidis; if you have a point that's just within the window and a point that's just outside the window, you can probably imagine that a jump in how they're weighted would cause some oddities in the kriging answer). This all means that you'd probably want to use a spherical, gaussian, or exponential variogram model with a relatively short range and a large sill (i.e., the semivariance plateaus to a large value relatively quickly, or equivalently the spatial covariance drops off quickly) or a linear variogram with a relatively large slope (again, semivariance grows quickly/spatial covariance drops off quickly). In real life, the details of the variogram model don't make that big a difference, but I think the choice of variogram model and parameters might be more important in this case.

Also, there would be no guarantees that the kriging equations at each grid point would have anything in common when kriging with a moving window. When kriging with all the data, at least the A matrix is the same everywhere, so we can just invert that once. When kriging with a moving window, there's no guarantee that A is the same, or even the same size, at any two grid points. We would then have to set up a separate kriging system at each grid point, and I'm not sure how that would work as a vectorized operation in Python. I also don't know how looping through a large grid and setting up/solving the entirety of a smaller kriging system at each point would compare computationally to solving for everything at once with a single direct call to underlying C/Fortran code. I'm sure you have more thoughts on how to make the actual computations work than I do...

Anyways, adding the capability to krige with a moving neighborhood wouldn't necessarily be a bad thing. Ideally, we'd have the option to either use the nearest N points for kriging or use all points within a distance of X from the current grid point (or use the nearest points within a distance of X, up to a total number N). The former would at least keep the size of the kriging system consistent from grid point to grid point, but both features would be desirable.

rth commented 9 years ago

Thank you for these very helpful comments. I am not very familiar with the mathematical background for kriging so it is good to know. I completely agree that this should be an optional feature that can be activated so it does not break the current functionality.

In terms of the implementation, I was thinking to do the following for the Ordinary Kriging on a moving window. 1) Starting from the version with the loop (as it's the simplest to work with), so we have your n data points and some grid. 2) We calculate the full kriging matrix A for all the data points 3) In every loop iteration we select only the rows/columns of A that correspond to data points within a radius X of the grid-point (let's start with that, and add the nearest points within a distance of X, up to a total number N afterwards). 4) Solve the small system A_selection*x = b_selection with scipy.linalg.solve. It is indeed annoying that the matrix A_selection would change from step to step, and so we couldn't use the inverse, but if it is small this would be much faster than computations with a full matrix.

Let me know whether you think this makes sens for kriging with a moving window.

The next step would be to port this version with a loop to Cython for speed. Having looked at various optimization for this, if we want something that is fast, not too difficult to read and doesn't use lots of memory, porting the loop to C with Cython seem to be the optimal solution. It does not change the the installation procedure with python setup.py install, but merely adds Cython as a dependency for this module (which should be included by default with most scientific python distributions...). Would you be ok, accepting Cython extensions for this module in a Pull Request? Anyway it would be just an alternative backend alongside the python version with a loop.