SciTools / iris

A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data
https://scitools-iris.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
634 stars 283 forks source link

Error comparing cubes with string content. #5362

Open pp-mo opened 1 year ago

pp-mo commented 1 year ago

🐛 Bug Report

When cubes have string content (normally with dtype 'S1', and a string dimension), cube comparison fails .

How To Reproduce

>>> from iris.cube import Cube
>>> cube1 = Cube(np.array([list('abc'), list('def')], dtype='S1'))
>>> print(cube1)
unknown / (unknown)                 (-- : 2; -- : 3)
>>> cube1.data
array([[b'a', b'b', b'c'],
       [b'd', b'e', b'f']], dtype='|S1')
>>>
>>> cube2 = cube1.copy()
>>> cube1 == cube2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/h05/itpp/git/iris/iris_main/lib/iris/cube.py", line 3672, in __eq__
    ).compute()
      ^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/base.py", line 314, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/base.py", line 599, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/threaded.py", line 89, in get
    results = get_async(
              ^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/local.py", line 511, in get_async
    raise_exception(exc, tb)
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/local.py", line 319, in reraise
    raise exc
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/local.py", line 224, in execute_task
    result = _execute_task(task, data)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/core.py", line 119, in <genexpr>
    return func(*(_execute_task(a, cache) for a in args))
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/core.py", line 113, in _execute_task
    return [_execute_task(a, cache) for a in arg]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/core.py", line 113, in <listcomp>
    return [_execute_task(a, cache) for a in arg]
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/core.py", line 113, in _execute_task
    return [_execute_task(a, cache) for a in arg]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/core.py", line 113, in <listcomp>
    return [_execute_task(a, cache) for a in arg]
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/optimization.py", line 990, in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/core.py", line 119, in <genexpr>
    return func(*(_execute_task(a, cache) for a in args))
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/core.py", line 113, in _execute_task
    return [_execute_task(a, cache) for a in arg]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/core.py", line 113, in <listcomp>
    return [_execute_task(a, cache) for a in arg]
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/utils.py", line 73, in apply
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/dask/array/core.py", line 4919, in _enforce_dtype
    result = function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<__array_function__ internals>", line 200, in isclose
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/numpy/core/numeric.py", line 2374, in isclose
    dt = multiarray.result_type(y, 1.)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<__array_function__ internals>", line 200, in result_type
TypeError: The DType <class 'numpy._FloatAbstractDType'> could not be promoted by <class 'numpy.dtype[bytes_]'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtype[bytes_]'>, <class 'numpy._FloatAbstractDType'>)
>>> 

Expected behaviour

Clearly, this should succeed and return True.

Key info

Although it appears to be a failure of dask.array.all_close, I think this is really a numpy problem

>>> np.all(cube1.data == cube2.data)
True
>>> np.allclose(cube1.data, cube2.data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 200, in allclose
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/numpy/core/numeric.py", line 2270, in allclose
    res = all(isclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<__array_function__ internals>", line 200, in isclose
  File "/tmp/persistent/newconda-envs/ncdata/lib/python3.11/site-packages/numpy/core/numeric.py", line 2374, in isclose
    dt = multiarray.result_type(y, 1.)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<__array_function__ internals>", line 200, in result_type
TypeError: The DType <class 'numpy._FloatAbstractDType'> could not be promoted by <class 'numpy.dtype[bytes_]'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtype[bytes_]'>, <class 'numpy._FloatAbstractDType'>)
>>> 

So, perhaps we need to special-case character data so it doesn't use 'allclose' for comparison.

Environment

Latest 'main' branch Iris, Dask version '2023.5.0' Numpy version '1.24.2'

pp-mo commented 1 year ago

I think there are some big questions over whether we really support cubes with string content at all, as this seems to have not worked for a long time : It didn't work for me at Iris 2.4.0 either (though with different error). There is also : https://github.com/SciTools/iris/issues/4412

We do support string coordinates, of course, and save them to netcdf. Because they are used for seasonal categories, e.g. here

pp-mo commented 2 months ago

I think there are some big questions over whether we really support cubes with string content at all, as this seems to have not worked for a long time : It didn't work for me at Iris 2.4.0 either (though with different error). There is also : #4412

UPDATE (Aug 2024, Iris 3.9). In fact, I find that Iris doesn't like the numpy fixed-width string dtypes like "S4" or "U7", because neither does netcdf4-python.

Instead, you are expected to use character arrays, dtype "S1" or "U1", containing fixed-length (padded) strings with a string-length dimension. That is standard practice for netcdf4-python and backwards-compatible with NetCDF3.

There are also variable-length strings (NetCDF "string" type). But CF does not seem to support these, and neither does Iris. That goes for all the variable-length types that were added in NetCDF4.

pp-mo commented 1 month ago

variable-length strings (NetCDF "string" type) ... CF does not seem to support these, and neither does Iris

Just learnt that this is not so. In fact CF do support "string" types in variables, since CF-1.8. added here

So I think this is definitely a live issue, and we should consider it.