[fairly critical] `extract_time()` preprocessor, masked data and new dask=2024.8.0 return cubes with fill values instead of masked elements - Githubissues

ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.

https://www.esmvaltool.org

Apache License 2.0

42 stars 38 forks source link

[fairly critical] `extract_time()` preprocessor, masked data and new dask=2024.8.0 return cubes with fill values instead of masked elements #2505

Closed valeriupredoi closed 1 month ago

valeriupredoi commented 1 month ago

extract_time() is causing an attribute loss that leads to fill_value not being taken into account in the next preprocessor, see example code below:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cubes = [extract_time(cube, **time_slice) for cube in cubes]

regrid_kwargs = {
    'grid': c0,
    'scheme': iris.analysis.Nearest(),
}
for c in cubes:
    reg_cub = c.regrid(**regrid_kwargs)
    print(np.mean(reg_cub.data))

toggle the time slicing and if it's on: one gets some infs in the mean computation with lazy data ie values of 1e+36 for data points, since the dataset looks like:

<class 'netCDF4._netCDF4.Variable'>
float32 ta(time, plev, lat, lon)
    standard_name: air_temperature
    long_name: Air Temperature
    units: K
    cell_methods: time: mean (interval: 1 month)
unlimited dimensions: 
current shape = (1095, 2, 3, 2)
filling on, default _FillValue of 9.969209968386869e+36 used

This is happening only with the new Dask; and not with the old 2024.7.1, see #2503

valeriupredoi commented 1 month ago

I can't see where the issue is coming from since if I look at a cube (netCDF4 Dataset) before and after extract time, it looks the same from netCDF4's point of view:

<class 'netCDF4._netCDF4.Variable'>
float32 ta(time, plev, lat, lon)
    standard_name: air_temperature
    long_name: Air Temperature
    units: K
    cell_methods: time: mean (interval: 1 month)
unlimited dimensions: 
current shape = (1095, 2, 3, 2)
filling on, default _FillValue of 9.969209968386869e+36 used

<class 'netCDF4._netCDF4.Variable'>
float32 ta(time, plev, lat, lon)
    standard_name: air_temperature
    long_name: Air Temperature
    units: K
    cell_methods: time: mean (interval: 1 month)
unlimited dimensions: 
current shape = (62, 2, 3, 2)
filling on, default _FillValue of 9.969209968386869e+36 used

so this means iris is missing something it needs

valeriupredoi commented 1 month ago

the code above, edited for the last cube in the list:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cube = cubes[-1]
cubes = extract_time(cube, **time_slice)

regrid_kwargs = {
    'grid': c0,
    'scheme': iris.analysis.Nearest(),
}

reg_cub = cube.regrid(**regrid_kwargs)
print(np.mean(reg_cub.data))

returns a valid numerical result - my brain is currently fried from trying to understand wth is going on in here, so am just gonna call it a day, for now.

valeriupredoi commented 1 month ago

OK got the bugger (almost)! It proves out if one constructs a list of cubes, then the bug creeps in:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cubes = [cubes[-1]]
cubes = [extract_time(cube, **time_slice) for cube in cubes]

regrid_kwargs = {
    'grid': c0,
    'scheme': iris.analysis.Nearest(),
}

for cube in cubes:
    reg_cub = cube.regrid(**regrid_kwargs)
    print(np.mean(reg_cub.data))

this will make the output from regrid to contain 1e36s instead of masked values for them!

valeriupredoi commented 1 month ago

even simpler:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cubes = [cubes[-1]]
cubes = [extract_time(cube, **time_slice) for cube in cubes]

print(cubes[0].data)

that cube will have 1e36s inside its data instead of masked elements!

valeriupredoi commented 1 month ago

OK - pushing the enemy even closer to the town center: it appears that the problem is with the sample data loader - saving the "problem" cube to disk, and loading from there ie:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

# cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
# iris.save(cubes[-1], "problem_cube.nc")
# cubes = [cubes[-1]]
cubes = [iris.load_cube("problem_cube.nc")]
cubes = [extract_time(cube, **time_slice) for cube in cubes]
print(cubes[0].data)

will result in correct behaviour of the masked bit

valeriupredoi commented 1 month ago

OK the core_data() members differ significantly for those two cubes, even though they should be identical:

dask.array<concatenate, shape=(1095, 2, 3, 2), dtype=float32, chunksize=(365, 2, 3, 2), chunktype=numpy.ndarray>
dask.array<array, shape=(1095, 2, 3, 2), dtype=float32, chunksize=(1095, 2, 3, 2), chunktype=numpy.ndarray>

valeriupredoi commented 1 month ago

blithering Hell! I isolated the problem - it's an iris issue, here the MRE/MRC:

import iris
import numpy as np

c1 = iris.load_cube("cubb-1.nc")
c2 = iris.load_cube("cubb-2.nc")

# apply slice to concatenated cube
slicer = (
    np.random.choice(a=[False, True], size=(730,)),
    slice(None, None, None),
    slice(None, None, None),
    slice(None, None, None)
)

# can use this to slice each of cubes c1 and/or c2
slicer1 = (
    np.random.choice(a=[False, True], size=(365,)),
    slice(None, None, None),
    slice(None, None, None),
    slice(None, None, None)
)

cube = iris.cube.CubeList([c1, c2]).concatenate_cube()
cubes = cube[slicer]
print("After slicing", cubes.data)

Gonna close this, after I open an issue at iris :beer:

valeriupredoi commented 1 month ago

issue in iris https://github.com/SciTools/iris/issues/6109 - will close this, and open a separate issue that monitors progress on the iris issue