SciTools / iris

A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data
https://scitools-iris.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
633 stars 283 forks source link

Iris cannot read NetCDF4 strings stored in NetCDF variables #4101

Open alastair-gemmell opened 3 years ago

alastair-gemmell commented 3 years ago

On behalf of an Iris User:

I'm having some trouble reading/writing NetCDF files with Iris 2.4 for datasets with string type variables. I thought I should bring what I've found to your attention and also in the hope that you might have solutions (the solution that I've found would require a small tweak to Iris' NetCDF reading - so not a viable solution for me currently). I’ve added code to demonstrate at the end of this email.

We need to read/write NetCDF files for meteorological station data. These include a list of station names, stored in a NetCDF variable of length equal to the number of stations. This corresponds to a data dimension for observations, structured in a 2D array, station index by time (not include in the example code). There are two possible approaches: using old style NetCDF character/byte arrays (undesirable - does not support special characters in our international station database) or NetCDF4 style unicode strings.

We can create an example cube as follows (you might need Python 3.7 for the ü in Düsseldorf):station_names = np.array([u'Exeter', u'London', u'Düsseldorf'])station_cube = iris.cube.Cube(station_names, long_name='station_names')

To save this cube I need to give iris.save a fill_value to use. The save also doesn’t work if the string data are stored in a masked array.

The fill value is again a problem when loading from the saved NetCDF file (see traceback at the end of this email). There appear to be two problems at the failing line in iris/fileformats/netcdf.py:1. netCDF4.default_fillvals contains no default entry for fill values for strings (other that S1 non-unicode type)2. cf_var.dtype.str[1:] fails because, on loading, the cf_var.dtype for the string data is of type str which does not have an 'str' attribute.

The failing line in iris.fileformats.netcdf._get_cf_var_data reads:fill_value = getattr(cf_var.cf_data, '_FillValue',netCDF4.default_fillvals[cf_var.dtype.str[1:]])

Problems arise in two places:

  1. netCDF4.default_fillvals contains no default entry for string types other than the S1 (non-unicode) dtype. This is the same problem that we had when saving without a fill_value argument set.
  2. cf_var.dtype.str[1:] fails because the cf_var.dtype for the loaded string data is of type str, which does not have an str attribute.

I tried a nasty hack to stop Iris from looking for a default fill_value at the failing line. This works around the problem and the cube loads without issue. This clearly this isn't a viable solution for me to implement and I’m sure that I’m missing other complexities.

I hope that this makes sense and is of some use to you. Our current solution involves over 10,000 individual NetCDF files, one for each station, as we can store Unicode strings in NetCDF attributes with no problem. The large overhead for I/O of lots of small NetCDF files is rather cumbersome in our application and for end users of the dataset.

Example code for station name I/Oimport numpy as npimport iris

filename = 'string_test.nc'

Setup a numpy array of station names to be saved. Umlaut may not work prior to python 3.7.#station_names = np.array([u'Exeter', u'London', u'Düsseldorf'])station_names = np.array([u'Exeter', u'London', u'Dusseldorf'])

Make our cube to save - station_names cannot be a masked array or iris.save fall overstation_cube = iris.cube.Cube(station_names, long_name='station_names')

Save and load to test. fill_value must be set or iris.save will fall over (no corresponding data type in netCDF4.default_fillvals).iris.save(station_cube, filename, fill_value='N/A')

Reload data. This failsloaded_station_cube = iris.load_cube(filename)

This returns the following traceback:Traceback (most recent call last):File "", line 1, in File "/[path]/lib/python3.7/site-packages/iris/init.py", line 387, in load_cubecubes = _load_collection(uris, constraints, callback).cubes()File "/[path]/lib/python3.7/site-packages/iris/init.py", line 325, in _load_collectionresult = iris.cube._CubeFilterCollection.from_cubes(cubes, constraints)File "/[path]/lib/python3.7/site-packages/iris/cube.py", line 157, in from_cubesfor cube in cubes:File "/[path]/lib/python3.7/site-packages/iris/init.py", line 312, in _generate_cubesfor cube in iris.io.load_files(part_names, callback, constraints):File "/[path]/lib/python3.7/site-packages/iris/io/init.py", line 210, in load_filesfor cube in handling_format_spec.handler(fnames, callback):File "/[path]/lib/python3.7/site-packages/iris/fileformats/netcdf.py", line 714, in load_cubescube = _load_cube(engine, cf, cf_var, filename)File "/[path]/lib/python3.7/site-packages/iris/fileformats/netcdf.py", line 524, in _load_cubedata = _get_cf_var_data(cf_var, filename)File "/[path]/lib/python3.7/site-packages/iris/fileformats/netcdf.py", line 510, in _get_cf_var_datanetCDF4.default_fillvals[cf_var.dtype.str[1:]])AttributeError: type object 'str' has no attribute 'str'

wjbenfold commented 2 years ago

This happens because doing cf_var.dtype gives us str rather than a numpy datatype. This causes issues in the save code, when it's determining the default fill value and when it's checking against itemsize of the dtype. It also causes issues in the load code once you've fixed the save code because the lookup fails similarly and if a naïve fix is applied it gives the dtype of the loaded cube as object rather than <U9 or whatever the original dtype was.

github-actions[bot] commented 11 months ago

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

github-actions[bot] commented 10 months ago

This stale issue has been automatically closed due to a lack of community activity.

If you still care about this issue, then please either:

christosvlahos commented 1 month ago

I am having similar issues trying to read some weather data from ecmwf. One of the variables is stored as string and even using the latest iris version gives me the same error. Is it something that might be included in a future version maybe?

pp-mo commented 1 month ago

I am having similar issues trying to read some weather data from ecmwf. One of the variables is stored as string and even using the latest iris version gives me the same error. Is it something that might be included in a future version maybe?

Thanks for sharing @christosvlahos An external query is enough to make me think that this is worth looking into, after all.

pp-mo commented 1 month ago

See updated observations here

christosvlahos commented 1 month ago

Thank you for opening this again!