GeoStat-Framework / PyKrige

Kriging Toolkit for Python
https://pykrige.readthedocs.io
BSD 3-Clause "New" or "Revised" License
759 stars 188 forks source link

MemoryError: Unable to allocate array with shape (114671, 114671) and data type float64 #166

Closed robo-warrior closed 4 years ago

robo-warrior commented 4 years ago

I get the following error: MemoryError: Unable to allocate array with shape (114671, 114671) and data type float64

Defining Ordinary Kriging as:

gridx = np.arange(min_x, max_x, 1)
gridy = np.arange(min_y, max_y, 1)

# Ordinary Kriging
OK = OrdinaryKriging(x, y, z, variogram_model='exponential',
                       verbose=False, enable_plotting=True, coordinates_type='geographic')

z1, ss = OK.execute('grid', gridx, gridy)

Where, min_x = 8084396 min_y = 12073405 max_x = 8084864 max_y = 12073894

I understand that grid_x and grid_y arrays are too big. What can I do in this case to make this work?

hoax-killer commented 4 years ago

I have also faced this same issue in the past, is this a limitation of OK/PyKrige?

MuellerSeb commented 4 years ago

I guess your input arrays x,y,z have the size (114671,), correct? This is much to big for most RAM, so you get a MemoryError.

What could help in your case, is to sample from this big amount of data and reduce to about 10.000 ~ 20.000 datapoints:

sample_size = 10000
choice = np.random.choice(np.arange(x.size), sample_size)
x_smpl = x[choice]
y_smpl = y[choice]
z_smpl = z[choice]

Now you can use x_smpl instead of x for the OrdinaryKriging class.

hoax-killer commented 4 years ago

Thanks @MuellerSeb for getting back.

This is much to big for most RAM, so you get a MemoryError.

At least in my case its not a RAM error. The system has about 396 GB RAM. Moreover, I can read the entire file into the RAM, the file isn't big too. numpy can also load the data into memory, and I am able to run almost all other methods (e.g. KNN, RF, SVC, etc.) The issue only arose when running OrdinaryKriging().

MuellerSeb commented 4 years ago

When using float64 with an array with a size of 114671*114671 (the cdist matrix), you result in minimum in 100GB of data. And this has at least the same order of magnitude as your provided RAM. With an overhead of numpy and some additional arrays of this kind, RAM can be a problem. And you have written that you get a MemoryError, which states that you are running out of ram. You could try setting n_closest_points in the execute call, so the full cdist matrix is not created.

hoax-killer commented 4 years ago

The cause of memory error is more specifically on this line: https://github.com/GeoStat-Framework/PyKrige/blob/48bc43362218b283bcba51b2c0990b72f7e29fbb/pykrige/core.py#L75

More specifically, in executing lon1 - lon2

I am not aware of the specifics of this operation, and why we calculate the difference between lon1 and lon2. The shapes of both numpy arrays are-

lon1.shape
Out[2]: (1, 114671)
lon2.shape
Out[3]: (114671, 1)
mjziebarth commented 4 years ago

To chime in, having written that part of the code:

That part of the code is a fairly simple implementation of the third equation of the section Computational formulas of this Wikipedia article. It was written using a simple vectorized version of the equation which creates a number of temporary arrays corresponding to terms of the rather large equation.

If you are working so tightly at your RAM limit, these additional terms could be the icing on the cake. Apart from random subsampling, you could try to work in Euclidean space (see #149), if you don't explicitly want to use the great-circle distance at large distances. Specifically, this would mean to compute Euclidean coordinates x,y,z from your latitudes and longitudes, and then kriging without the coordinates_type='geographic' option. Maybe this saves just enough temporary arrays to fit into your RAM.

Hope that helps!

hoax-killer commented 4 years ago

@mjziebarth Thanks for getting back.

If you are working so tightly at your RAM limit

I have about 396 GB RAM, much more than normal limits. Hence, it was a bit surprising to hit the limit which we rarely have with much larger datasets.

It was written using a simple vectorized version of the equation which creates a number of temporary arrays corresponding to terms of the rather large equation.

TBH, I don't see anything wrong in your code, its just a bit surprising that its hitting the memory limits.

Since we are trying to benchmark, I was hesitant to change the parameters, but worst case, we can switch to Euclidean space for all the benchmarking tests. We are anyways using Google's Pixel coordinates system, hence using euclidean might make more sense.

robo-warrior commented 4 years ago

Hi, using Euclidean coordinates seemed to work for us. Thank you all for super quick responses!

MuellerSeb commented 4 years ago

That is really interesting! Thanks for sharing.