CCI-Tools / cate

ESA CCI Toolbox (Cate)
MIT License
50 stars 15 forks source link

MemoryError when opening larger 0..360 datasets #716

Open JanisGailis opened 6 years ago

JanisGailis commented 6 years ago

When opening larger datasets that are defined on 0..360 longitude grid, such as esa_msla_ext given by Prosper, a MemoryError is encountered on some machines (mine, 8GB).

This is due to these lines:

https://github.com/CCI-Tools/cate/blob/a1e31c4673399a99d58786d1208c2ba13de138f2/cate/core/opimpl.py#L185-L193

Calling var.values tries to load the entire variable in memory as an np.ndarray. The more memory safe way to approach this would be using a recursive groupby of the variable until only lat,lon or even just lon remains and then do the value swapping. In a similar way how it is done in coregistration:

https://github.com/CCI-Tools/cate/blob/a1e31c4673399a99d58786d1208c2ba13de138f2/cate/ops/coregistration.py#L258

EDIT: Even when 'groupbying' down to lat/lon, don't call .values on the data variable, as this will result in the new dataset being slowly converted to an xarray dataset consisting of many in-memory numpy datasets. Instead, do the conversion using some tricky indexing magic.

JanisGailis commented 6 years ago

Assigning @forman to figure out how to approach this and to make sure it doesn't disappear in noise.

forman commented 6 years ago

We came across this issue recently when trying to open CCI Land Cover with dim size (lon=120000, lat=60000) prepared for the Copernicus Climate Data Store. Their convention is 0 <= lon < 360! OMG.

forman commented 6 years ago

@JanisGailis happy to discuss a solution with you early next week.

JanisGailis commented 6 years ago

Sure, I'm quite sure what the problem is.