SciTools / iris

A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data
https://scitools-iris.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
631 stars 283 forks source link

Iris incorrectly failing valid netcdf names #5929

Open ScottWales opened 5 months ago

ScottWales commented 5 months ago

🐛 Bug Report

Iris is incorrectly validating netcdf variable names, rejecting opening valid files

How To Reproduce

Steps to reproduce the behaviour:

import netCDF4 as nc
import iris

root = nc.Dataset("sample.nc", "w", format="NETCDF4")
# Valid variable name according to netCDF rules
root.createVariable("a(x)", "f4")
root.close()

cubes = iris.load("sample.nc")

Iris fails in iris/common/mixin.py", line 177 with error ValueError: 'a(x)' is not a valid NetCDF variable name.

Expected behaviour

Iris should safely load the variable with a valid NetCDF name.

Iris matches names against a regex at https://github.com/SciTools/iris/blob/main/lib/iris/common/metadata.py#L41-L43, referencing an outdated link to netCDF documentation. If validation is performed on reading NetCDF files it should follow the current format rules:

nc_name_re = re.compile(r"""
    ^([a-zA-Z0-9_]|[^\x00-\x7F]) # Start with ASCII word or multibyte unicode
    ([^\x00-\x1F/\x7F-\xFF]|[^\x00-\x7F])* # Continue with non-control ASCII or multibyte unicode
    (?<!\s)$ # Don't end with whitespace
    """,
    re.VERBOSE)

Alternately validation could be performed only when writing out files, or left to the NetCDF library.

Unidata give this guidance on netcdf names in the file format documentation:

Note on names: Earlier versions of the netCDF C-library reference implementation enforced a more restricted set of characters in creating new names, but permitted reading names containing arbitrary bytes. This specification extends the permitted characters in names to include multi-byte UTF-8 encoded Unicode and additional printing characters from the US-ASCII alphabet. The first character of a name must be alphanumeric, a multi-byte UTF-8 character, or '_' (reserved for special names with meaning to implementations, such as the “_FillValue” attribute). Subsequent characters may also include printing special characters, except for '/' which is not allowed in names. Names that have trailing space characters are also not permitted.

The netCDF4 library will give an error if you try to create a variable with an invalid name, e.g. one that ends with whitespace:

import netCDF4 as nc
root = nc.Dataset("sample.nc", "w", format="NETCDF4")
# Invalid netCDF variable name, raises exception `RuntimeError: NetCDF: Name contains illegal characters`
root.createVariable("invalid ", "f4")
root.close()

Environment

Additional context

Full stack trace ``` $ python sample_iris.py 2>&1 | sed 's:".*/python:".../python:' Traceback (most recent call last): File "/scratch/hc46/saw562/tmp/pete/sample_iris.py", line 13, in cubes = iris.load("sample.nc") File ".../python3.10/site-packages/iris/__init__.py", line 324, in load return _load_collection(uris, constraints, callback).merged().cubes() File ".../python3.10/site-packages/iris/__init__.py", line 290, in _load_collection result = _CubeFilterCollection.from_cubes(cubes, constraints) File ".../python3.10/site-packages/iris/cube.py", line 106, in from_cubes for cube in cubes: File ".../python3.10/site-packages/iris/__init__.py", line 271, in _generate_cubes for cube in iris.io.load_files(part_names, callback, constraints): File ".../python3.10/site-packages/iris/io/__init__.py", line 236, in load_files for cube in handling_format_spec.handler( File ".../python3.10/site-packages/iris/fileformats/netcdf/loader.py", line 576, in load_cubes cube = _load_cube(engine, cf, cf_var, cf.filename) File ".../python3.10/site-packages/iris/fileformats/netcdf/loader.py", line 285, in _load_cube engine.activate() File ".../python3.10/site-packages/iris/fileformats/_nc_load_rules/engine.py", line 97, in activate run_actions(self) File ".../python3.10/site-packages/iris/fileformats/_nc_load_rules/actions.py", line 521, in run_actions action_default(engine) # This should run the default rules. File ".../python3.10/site-packages/iris/fileformats/_nc_load_rules/actions.py", line 74, in inner rule_name = func(engine, *args, **kwargs) File ".../python3.10/site-packages/iris/fileformats/_nc_load_rules/actions.py", line 90, in action_default hh.build_cube_metadata(engine) File ".../python3.10/site-packages/iris/fileformats/_nc_load_rules/helpers.py", line 397, in build_cube_metadata cube.var_name = cf_var.cf_name File ".../python3.10/site-packages/iris/common/mixin.py", line 177, in var_name raise ValueError(emsg.format(name)) ValueError: 'a(x)' is not a valid NetCDF variable name. ```
pp-mo commented 5 months ago

Thanks @ScottWales for raising this. I certainly do see what you mean, and I know what it sounds like, but in fact I think this error is more of a mis-speak than a total mistake. What it should really say is "is not a valid NetCDF -CF variable name". I.E. it is applying the rules given here

So, Iris generally follows CF conventions, and in many cases requires them. We consider "CF strictness" as generally quite an important feature of Iris, though we do also have quite a few workarounds for common violations (like invalid "standard_name" and "units" attributes, and dimension variables in "coordinates" attributes). The appropriate level of this has also been a subject of some considerable discussion in recent times.

Unfortunately, we don't have a workaround for this problem.
I think it might even be a bit tricky to devise a solid automatic scheme for replacing non-compliant variable names, since the CF rules are rather highly fussy, not allowing "%" for instance.

Obviously we can fix this error message to give the proper context, but it remains a fact that Iris is rather strict about this + it is hard to work around. If practical, you might consider using "ncdata" to fix this, which at least avoids a need to copy or modify the original file. It's very much a kind of problem which that was designed for. Something like this should work ...

from ncdata.netcdf4 import from_nc4
from ncdata.iris import to_iris
ncds = from_nc4("sample.nc")
ncds.variables.rename("a(x)", "a_x")
cubes = to_iris(ncds)
print(cubes)

I tested that on your minimal example + it does work for me.

Of course, in an actual case it may not be so simple, as you may need to correct references to the variable elsewhere, like e.g. data::coordinates = "a(x) b(y)" -- though I suppose that wouldn't be good CF either, so perhaps it's not a likely case.

Does this help at all @ScottWales ?

pp-mo commented 5 months ago

Ok update on this -- I was a bit wrong...

It seems that the current restrictions in Iris don't actually follow the CF rules, but something slightly similar, which is in fact the rules given for "Classic" NetCDF -- effectively NetCDF3. The statement for that (which I think the Iris regexp you noted is implementing) can now be found here in the current Netcdf User Guide : Permitted Characters in NetCDF Names

However, as also stated in that para, NetCDF4 removes most of those restrictions -- though quite what "as well as other special characters" means is still a bit in question. This is what we see in your experiments creating variables. Following that method, I find -- purely from experiment :

Actually though, I would still say that ideally Iris probably should be enforcing the CF rules -- which would actually be more restrictive than what we currently have.

In general , I'd suggest a reasonable goal that the cf-checker will not object to anything saved from Iris : As we don't currently have any special checks on save, that is just what we allow for setting a "cube.var_name", which is exactly the scope of the regexp and the error which you have noted.

Just to confuse things further, though, the CF community are currently discussing the possibility of allowing basically any valid netcdf names after all (!)

pp-mo commented 5 months ago

Iris probably should be enforcing the CF rules -- which would actually be more restrictive than what we currently have

( But practically, that would be a painful journey involving breaking changes + deprecations )