Opening grb files fails in xarray

HelixPiano commented 1 year ago

What happened?

Hello everyone, I am not sure if this a bug in xarray or a bug with cfgrib, I will therefore crosspost it.

I have a grb file with the dimension 30316x160x392 , filetype float32 and filesize of around 3.7GB.

df= xr.open_dataset("129.grb", engine="cfgrib") works initially. The problem is when I call df.max() it maxes out the RAM of my PC and fails to return any result. RAM usage before df.max() call: 3.5/16GB

If I run df= xr.load_dataset("129.grb", engine="cfgrib") instead I get an error message:

What are the steps to reproduce the bug?

-

Version

0.9.10.3

Platform (OS and architecture)

Windows 10 Pro

Relevant log output

  File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.3.2\plugins\python-ce\helpers\pydev\pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1, in <module>
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\backends\api.py", line 264, in load_dataset
    return ds.load()
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\dataset.py", line 760, in load
    v.load()
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\variable.py", line 539, in load
    self._data = self._data.get_duck_array()
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 695, in get_duck_array
    self._ensure_cached()
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 689, in _ensure_cached
    self.array = as_indexable(self.array.get_duck_array())
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 663, in get_duck_array
    return self.array.get_duck_array()
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 550, in get_duck_array
    array = self.array[self.key]
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\xarray_plugin.py", line 156, in __getitem__
    return xr.core.indexing.explicit_indexing_adapter(
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 857, in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\xarray_plugin.py", line 165, in _getitem
    return self.array[key]
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\dataset.py", line 354, in __getitem__
    message = self.index.get_field(message_ids[0])  # type: ignore
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\messages.py", line 484, in get_field
    return ComputedKeysAdapter(self.fieldset[message_id], self.computed_keys)
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\messages.py", line 344, in __getitem__
    return self.message_from_file(file, offset=item)
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\messages.py", line 340, in message_from_file
    return Message.from_file(file, offset, **kwargs)
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\messages.py", line 93, in from_file
    file.seek(offset)
OSError: [Errno 22] Invalid argument

Accompanying data

No response

Organisation

No response

iainrussell commented 1 year ago

Hello @HelixPiano,

Thanks for the report. I could reproduce the high memory usage of the df.max() function using a large GRIB file that I have here. However, I then converted the GRIB file into NetCDF format and tried the same thing with this NetCDF file, which just uses plain xarray and not cfgrib, and the memory profile was similar (in fact the NetCDF version used more memory than the GRIB version).

I used ecCodes to perform the conversion:

grib_to_netcdf global_wind_2020_12.grib -o global_wind_2020_12.nc

So from this I'd have to conclude that cfgrib is not the culprit here, but xarray itself might be loading all the values arrays into memory at once in order to compute the maximum. Are you able to confirm this? If so, we should close this issue, and maybe you can raise one in xarray itself.

Cheers, Iain

iainrussell commented 1 year ago

Thanks for reporting. I'm closing this now, and we can re-open it, or open a new one if we have a case where we can confirm that NetCDF does not show the same issue.

jsxiaxusheng commented 8 months ago

I think this problem is the same with #70 . I also had this problem when t>8.00, the file offset becomes -5.

yehao1999 commented 2 months ago

I encountered the same issue when indexing t > 1100; the offset in FileStreamItems becomes -5.

After some troubleshooting, I think that the size of the long* type on Windows might be the root cause. When reading large GRIB files for the first time, a 4-byte long* pointer value_p is created and assigned a value of -5 after exceeding 2**31 in gribapi.grib_get_long(msgid, key). This pointer value becomes the offset in the large GRIB file and retruns to messages.Message.message_get(self, item, key_type=None, default=_MARKER). The offsets continue to return -5 in messages.FileStreamItems.__iter__() and are stored in indexing files (whether in files or RAM). When actually reading the GRIB file to get values, an OSError: [Errno 22] Invalid argument is raised.

However, -5 is returned by lib.grib_get_long() in gribapi.grib_get_long(msgid, key). I don't have the capability to troubleshoot this issue further. A potential solution might involve using an explicit long long* statement or other methods to upgrade to a 64-bit integer pointer.

Currently, using a smaller GRIB file or switching to Linux can resolve the problem.

ecmwf / cfgrib