netCDF save of cube with lazy data uses wrong type

MartinDix commented 2 years ago

🐛 Bug Report

I used regridding to calculate the overlap of an integer mask of a river basin with model grid cell areas. Both cubes are read from netCDF files so have lazy data. The result is calculated correctly but when saved to a netCDF file it has integer type rather than real, so truncates fractional overlap. If the data is forcibly realized before writing it works as expected.

How To Reproduce

This self contained example creates area and mask cubes, writes them and rereads to avoid needing any external files

import iris, iris.coords, iris.cube, numpy as np

nlon1 = nlat1 = 10
lon1 = iris.coords.DimCoord(np.linspace(0,360,nlon1,endpoint=False), units='degrees_east', standard_name='longitude')
lat1 = iris.coords.DimCoord(np.linspace(-90,90,nlat1), units='degrees_north',  standard_name='latitude')
lon1.guess_bounds()
lat1.guess_bounds()
area = iris.cube.Cube(np.ones((nlat1,nlon1)), dim_coords_and_dims=[(lat1,0),(lon1,1)])
iris.save(area, 'area.nc')

nlon2 = nlat2 = 8
lon2 = iris.coords.DimCoord(np.linspace(0,360,nlon2,endpoint=False), units='degrees_east', standard_name='longitude')
lat2 = iris.coords.DimCoord(np.linspace(-90,90,nlat2), units='degrees_north',  standard_name='latitude')
lon2.guess_bounds()
lat2.guess_bounds()
mask = iris.cube.Cube(np.ones((nlat2,nlon2), np.int32),
                           dim_coords_and_dims=[(lat2,0),(lon2,1)])
iris.save(mask, 'mask.nc')

area = iris.load_cube('area.nc')
mask = iris.load_cube('mask.nc')

cube = mask.regrid(area, iris.analysis.AreaWeighted(mdtol=0.5))

print('Lazy', cube.has_lazy_data())
tmp = cube.lazy_data()
print(tmp)
iris.save(cube,'test1.nc')

# Force data to be realized
print('DTYPE', cube.data.dtype)
print('Lazy', cube.has_lazy_data())
tmp = cube.lazy_data()
print(tmp)
iris.save(cube,'test2.nc')

Expected behaviour

Variable in test1.nc should be of type double instead of int. Note that variable tmp has incorrect type just like the netCDF file.

Environment

Iris Version: 3.0.1 (linux and windows) and 3.1 (linux)

Additional context

Script output.

Lazy True
dask.array<_regrid_area_weighted_array, shape=(10, 10), dtype=int32, chunksize=(10, 10), chunktype=numpy.ndarray>
DTYPE float64
Lazy False
dask.array<array, shape=(10, 10), dtype=float64, chunksize=(10, 10), chunktype=numpy.ndarray>

wjbenfold commented 2 years ago

Thanks for this, it's incredibly helpful to have a minimal broken example with a bug report.

Diagnosis:

We've (myself, @trexfeathers and @stephenworsley) discussed what's happening and are pretty sure that it's the following:

cube starts off with lazy integer data
cube is regridded, which is a lazy operation (as of iris 3.0.0). This regrid should update the type that dask believes that the array has from int32 to float64 but doesn't because it fails when passed the 0-d array that dask uses to diagnose resultant dtype and doesn't explicitly tell dask that the output dtype will differ from the input dtype.
cube is saved by the netcdf saver, which trusts dask on its belief that it will evaluate to an int32 array. This means that the saver casts the values to int32 as they're lazily written to memory.

Workaround:

Rather than fully realising the data, it's possible to just use an operation on the lazy data that will update what dask believes the lazy array's dtype to be. I suggest:

cube.data = cube.lazy_data().astype(np.float64)

Introducing this after the regrid in the example resolves the issue (both the reported type of tmp and the type of the test1.nc when it's loaded into iris).

Suggested fix:

The regridding operation should either accept a 0-d array and return it so that dask understands the output, or we should pass the meta argument to map_blocks to explicitly set the dtype. We could make it so that the regridding operation returns data of the same dtype it was given, which would also resolve the issue (as dask would be correct in its assumption that the dtype hasn't changed). This would be a breaking change for users who rely on this dtype change though.

Other:

It's worth noting that in the example script the final printing of tmp appears to show a lazy array but because cube.data has been called this is actually a dask array representation of a realised array, the data hasn't become lazy again.

wjbenfold commented 2 years ago

After a bit more investigation, it looks like at least sometimes dask does manage to figure out that the return type should be float64, but then that belief is overridden by the specified dtype in map_complete_blocks' call to map_blocks: https://github.com/SciTools/iris/blob/4abaa8f5d4be918b2e0b7bd42fbcc1e0a0196dd6/lib/iris/_lazy_data.py#L388

I'm not sure where to fix this because given that the regridder doesn't let the test array through correctly (and doing so is complicated by dask's compute_meta not always using an array with size 0), it seems like map_complete_blocks should set the dtype and meta correctly. But this means inferring what to set them to from the function that's passed to it, which it can't do for the same reasons as dask can't. So I think that it should be incumbent on functions calling map_complete_blocks to pass arguments to it that specify dtype and meta if required.

This shouldn't be a breaking change because it's hidden away in _lazy_data.

Does it seem a sensible plan to give map_complete_blocks some extra optional arguments and stop it from assuming the out-type will match the in-type?

github-actions[bot] commented 11 months ago

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

trexfeathers commented 11 months ago

This still seems a genuine problem that should be followed up when time allows.

SciTools / iris