ecmwf / cfgrib

A Python interface to map GRIB files to the NetCDF Common Data Model following the CF Convention using ecCodes
Apache License 2.0
407 stars 78 forks source link

Memory Allocation Error when Exporting Dataset to Numpy Array #248

Open lassiterdc opened 3 years ago

lassiterdc commented 3 years ago

The sliced data array is 0.002 megabytes but when I try to export it to a numpy array, it loads the entire unsliced data array into memory which causes a memory allocation error (which is in itself odd since the dataset is only 318MB). I'm not sure if this is a cfgrib issue or an xarray issue so let me know if this is the wrong forum.

MRMS precipitation data in grib2 format (318MB): https://www.dropbox.com/s/xq0vvamcydu3qn2/PrecipRate_20210524.grib2?dl=0

#%% Data processing
import cfgrib

f_in_grib = "PrecipRate_20210524.grib2"

mrms = cfgrib.open_dataset(f_in_grib, engine="cfgrib", chunks = {'latitude':10, 'longitude':10})

idx = dict(latitude=1808, longitude=5374)

data_at_gages = mrms[idx]

#%% inspection
print("The dataset 'data_at_gages' is {} megabytes.".format(data_at_gages.unknown.nbytes/1e6))
# The dataset 'data_at_gages' is 0.0021 megabytes.

print("The dimensions of the sliced data array are: {}".format(data_at_gages.unknown.shape))
# The dimensions of the sliced data array are: (525,)

#%% error
data_at_gages.unknown.values.to_numpy

# MemoryError: Unable to allocate 47.9 GiB for an array with shape (525, 3500, 7000) and data type float32
iainrussell commented 3 years ago

Hello @lassiterdc ,

I've had a look at this in the debugger, and it looks to me like cfgrib is not properly handing this case. It looks like it is always trying to allocate enough space to hold all the data points, not just the lat/lon that you selected. And since you are accessing all the time steps in one go, this all multiplies to give 47.9 GiB of memory that it tries to allocate.

This is something we will need to fix, but it won't likely be done in the next few weeks. But in the meantime you might be able to get the result you want by slicing or iterating over the times so that not so much memory is allocated at once. This is the key point - don't access all the time steps at once.

If you just want to get a time series at a point, you can also consider using Metview as another package. Here's the equivalent code in Metview's Python interface:

import metview as mv

g = mv.read("PrecipRate_20210524.grib2")

near = mv.nearest_gridpoint(g, 40, -100) # lat,lon
print(near)

vt = mv.valid_date(g)
print(vt)

Disclaimer: it's ok for me to suggest Metview, as it is also an ECMWF package!

I hope this helps a little. Best regards, Iain

lassiterdc commented 3 years ago

Thanks for the insight and the suggested workaround. I ended up being able to run 'data_at_gages.to_dataframe' which capped out my RAM but didn't trigger the memory allocation error for some reason. I'll look more time with the Metview library, but I'm still keeping an eye out for a cfgrib update that'll allow me to slice and dice my grib2 data efficiently :)

derfasthirnlosenick commented 1 year ago

Hi! Is there an update on this in the meantime? I'm running into exactly the same issue here as well. Given that MetView is OSX and Linux, I can't use it on this device....