SciTools / iris

A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data
https://scitools-iris.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
633 stars 283 forks source link

Cannot save non-ASCII characters to NetCDF #5125

Open trexfeathers opened 1 year ago

trexfeathers commented 1 year ago

🐛 Bug Report

From @gavinevans

Attempting to save a Cube including a string AuxCoord with non-ASCII characters (i.e. Unicode characters) raises the following exception:

UnicodeEncodeError: 'ascii' codec can't encode character '\xe8' in position 0: ordinal not in range(128)

How To Reproduce

Steps to reproduce the behaviour:

import iris
from iris.coords import AuxCoord, DimCoord
from iris.cube import Cube

spot_index = DimCoord([0, 1], long_name='site_index', units=1)

station_name = AuxCoord(["Robièi", "Mühleberg"], long_name="station_name")
# This one works:
# station_name = AuxCoord(["Robiei", "Muhleberg"], long_name="station_name")

cube = Cube(
    [3, 4],
    dim_coords_and_dims=[(spot_index, 0)],
    aux_coords_and_dims=[(station_name, 0)]
)

iris.save(cube, "tmp.nc")

Expected behaviour

Should save with no exception (as happens when using the commented line above).

Environment

Additional context

Related:

I think the fix will hinge on allowing for the extra bytes needed to store encoded Unicode characters. We currently divide the length in 4, which I think means we are always assuming a Unicode string can be converted to an ASCII one:

https://github.com/SciTools/iris/blob/fc302c9c08c292cb2075d2dd249bcbdfacf08da8/lib/iris/fileformats/netcdf/saver.py#L1881-L1883

Changing this could have loading consequences too?

Expand for traceback with Iris v3.4 ``` Traceback (most recent call last): File ".../iris/lib/2023-01-03_gavin.py", line 17, in iris.save(cube, "tmp.nc") File ".../iris/lib/iris/io/__init__.py", line 457, in save saver(source, target, **kwargs) File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 2754, in save sman.write( File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 755, in write self._add_aux_coords(cube, cf_var_cube, cube_dimensions) File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 1088, in _add_aux_coords return self._add_inner_related_vars( File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 1053, in _add_inner_related_vars cf_name = self._create_generic_cf_array_var( File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 1917, in _create_generic_cf_array_var new_data[index_slice] = list( UnicodeEncodeError: 'ascii' codec can't encode character '\xe8' in position 0: ordinal not in range(128) ```
ESadek-MO commented 1 year ago

Hey @gavinevans, we're currently a bit low on resources, is this something you'd be interested on working on?

github-actions[bot] commented 4 months ago

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

gavinevans commented 4 months ago

This issue hasn't yet been resolved.

trexfeathers commented 4 months ago

This new activity has prompted a very useful discussion in @SciTools/peloton:

NetCDF only supports ASCII (i.e. every character must be 1 byte). Iris could do something with non-ASCII characters, but it would be Iris specific - no other library would know how to interpret it.

We're quite uncomfortable making an explicit decision here, since the Iris devs are not exposed to all the possible user cases. Since there is no official convention here, we would prefer for individual users/teams to define their own encode/decode rules, since they alone know the specifics (e.g. how many bytes are needed). This would probably take the form of a bytes array (rather than a character array), with user-written functions to write and read correctly. @gavinevans @brhooper how does this sound?

If anyone is aware of an 'official' convention that Iris should follow, please speak up 😊

larsbarring commented 4 months ago

I am not sure I understand what you mean with "NetCDF only supports ASCII (i.e. every character must be 1 byte)". Is the problem specific to string/char auxiliary coordinate values?

trexfeathers commented 4 months ago

I am not sure I understand what you mean with "NetCDF only supports ASCII (i.e. every character must be 1 byte)". Is the problem specific to string/char auxiliary coordinate values?

We believe you can use Unicode in NetCDF names and in string attributes, but NOT in any data arrays.