Closed valeriupredoi closed 1 month ago
I can't see where the issue is coming from since if I look at a cube (netCDF4 Dataset) before and after extract time, it looks the same from netCDF4's point of view:
<class 'netCDF4._netCDF4.Variable'>
float32 ta(time, plev, lat, lon)
standard_name: air_temperature
long_name: Air Temperature
units: K
cell_methods: time: mean (interval: 1 month)
unlimited dimensions:
current shape = (1095, 2, 3, 2)
filling on, default _FillValue of 9.969209968386869e+36 used
<class 'netCDF4._netCDF4.Variable'>
float32 ta(time, plev, lat, lon)
standard_name: air_temperature
long_name: Air Temperature
units: K
cell_methods: time: mean (interval: 1 month)
unlimited dimensions:
current shape = (62, 2, 3, 2)
filling on, default _FillValue of 9.969209968386869e+36 used
so this means iris is missing something it needs
the code above, edited for the last cube in the list:
import cf_units
import iris
import numpy as np
import pytest
from esmvalcore.preprocessor import extract_time
esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")
cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cube = cubes[-1]
cubes = extract_time(cube, **time_slice)
regrid_kwargs = {
'grid': c0,
'scheme': iris.analysis.Nearest(),
}
reg_cub = cube.regrid(**regrid_kwargs)
print(np.mean(reg_cub.data))
returns a valid numerical result - my brain is currently fried from trying to understand wth is going on in here, so am just gonna call it a day, for now.
OK got the bugger (almost)! It proves out if one constructs a list of cubes, then the bug creeps in:
import cf_units
import iris
import numpy as np
import pytest
from esmvalcore.preprocessor import extract_time
esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")
cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cubes = [cubes[-1]]
cubes = [extract_time(cube, **time_slice) for cube in cubes]
regrid_kwargs = {
'grid': c0,
'scheme': iris.analysis.Nearest(),
}
for cube in cubes:
reg_cub = cube.regrid(**regrid_kwargs)
print(np.mean(reg_cub.data))
this will make the output from regrid
to contain 1e36s instead of masked values for them!
even simpler:
import cf_units
import iris
import numpy as np
import pytest
from esmvalcore.preprocessor import extract_time
esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")
cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cubes = [cubes[-1]]
cubes = [extract_time(cube, **time_slice) for cube in cubes]
print(cubes[0].data)
that cube will have 1e36s inside its data instead of masked elements!
OK - pushing the enemy even closer to the town center: it appears that the problem is with the sample data loader - saving the "problem" cube to disk, and loading from there ie:
import cf_units
import iris
import numpy as np
import pytest
from esmvalcore.preprocessor import extract_time
esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")
# cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
# iris.save(cubes[-1], "problem_cube.nc")
# cubes = [cubes[-1]]
cubes = [iris.load_cube("problem_cube.nc")]
cubes = [extract_time(cube, **time_slice) for cube in cubes]
print(cubes[0].data)
will result in correct behaviour of the masked bit
OK the core_data()
members differ significantly for those two cubes, even though they should be identical:
dask.array<concatenate, shape=(1095, 2, 3, 2), dtype=float32, chunksize=(365, 2, 3, 2), chunktype=numpy.ndarray>
dask.array<array, shape=(1095, 2, 3, 2), dtype=float32, chunksize=(1095, 2, 3, 2), chunktype=numpy.ndarray>
blithering Hell! I isolated the problem - it's an iris issue, here the MRE/MRC:
import iris
import numpy as np
c1 = iris.load_cube("cubb-1.nc")
c2 = iris.load_cube("cubb-2.nc")
# apply slice to concatenated cube
slicer = (
np.random.choice(a=[False, True], size=(730,)),
slice(None, None, None),
slice(None, None, None),
slice(None, None, None)
)
# can use this to slice each of cubes c1 and/or c2
slicer1 = (
np.random.choice(a=[False, True], size=(365,)),
slice(None, None, None),
slice(None, None, None),
slice(None, None, None)
)
cube = iris.cube.CubeList([c1, c2]).concatenate_cube()
cubes = cube[slicer]
print("After slicing", cubes.data)
Gonna close this, after I open an issue at iris :beer:
issue in iris https://github.com/SciTools/iris/issues/6109 - will close this, and open a separate issue that monitors progress on the iris issue
extract_time()
is causing an attribute loss that leads tofill_value
not being taken into account in the next preprocessor, see example code below:toggle the time slicing and if it's on: one gets some
inf
s in the mean computation with lazy data ie values of 1e+36 for data points, since the dataset looks like:This is happening only with the new Dask; and not with the old 2024.7.1, see #2503