Issues with preprocessors that use fx variables

schlunma commented 3 years ago

I found some small issues with the new implementation of the fx preprocessors in #999 (great job btw there @sloosvel!).

Issues

Not directly related to #999, but do we have a list of all preprocessors for which fx_variables can be specified? Maybe I'm blind, but I didn't find it in the documentation. Basically the list given here: https://github.com/ESMValGroup/ESMValCore/blob/fee5d541dc871efe72ab972a6a52195c452d16eb/esmvalcore/_recipe.py#L482-L485
For mask_landsea we had checks for the shape of the fx variables in place. If the shapes did not match, Natural Earth masks were used. Right now, if the shapes do not match, an error is printed as reported by @LisaBock here: ValueError: Dimensions of tas and sftof cubes do not match. Cannot broadcast cubes. This is in particular problematic since for mask_landsea, the tool automatically adds the fx variables sftlf and sftof if none is specified. Since sftlf is always checked first, the tool is most likely to fail if an ocean variable is specified since the shapes of sfltf and ocean variables differ for most models. For example, the first recipe fails with the error ValueError: Dimensions of tos and sftlf cubes do not match. Cannot broadcast cubes., and the second one runs fine:

Recipe 1 (fails)

```yml preprocessors: test: mask_landsea: mask_out: land diagnostics: test: variables: tos: preprocessor: test project: CMIP5 mip: Omon exp: historical start_year: 2000 end_year: 2000 ensemble: r1i1p1 additional_datasets: - {dataset: CanESM2} scripts: null ```

Recipe 2 (works)

```yml preprocessors: test: mask_landsea: mask_out: land fx_variables: sftof: diagnostics: test: variables: tos: preprocessor: test project: CMIP5 mip: Omon exp: historical start_year: 2000 end_year: 2000 ensemble: r1i1p1 additional_datasets: - {dataset: CanESM2} scripts: null ```

For weighting_landsea_fraction, we had similar checks. The tool tried sftlf and sftof, and only if both had different shapes, an error was printed. Right now, due to the prioritization of sftlf, an error is e.g. printed for every ocean variables that is not on an atmospheric grid (thus does not match the shape of sftlf), even though sftof would work just fine.
As far as I can tell from the code (didn't test though!) the remaining preprocessors should behave similarly to the old version (the error messages might differ, though).
The following message is now printed to the log. I think we should demote this to a debug message and maybe only print the short_name of the fx variables as info message.

Log

``` 2021-05-05 08:28:29,557 UTC [14161] INFO Using fx_files: {'sftlf': {'alias': 'CanESM2', 'dataset': 'CanESM2', 'diagnostic': 'test', 'end_year': 2000, 'ensemble': 'r0i0p0', 'exp': 'historical', 'filename': '/pf/b/b309141/work/CMIP5_DKRZ/CCCma/CanESM2/historical/fx/atmos/fx/r0i0p0/v20120410/sftlf/sftlf_fx_CanESM2_historical_r0i0p0.nc', 'frequency': 'fx', 'institute': ['CCCma'], 'long_name': 'Land Area Fraction', 'mip': 'fx', 'modeling_realm': ['atmos'], 'original_short_name': 'sftlf', 'preprocessor': 'test', 'project': 'CMIP5', 'recipe_dataset_index': 0, 'short_name': 'sftlf', 'standard_name': 'land_area_fraction', 'start_year': 2000, 'units': '%', 'variable_group': 'sftlf'}, 'sftof': {'alias': 'CanESM2', 'dataset': 'CanESM2', 'diagnostic': 'test', 'end_year': 2000, 'ensemble': 'r0i0p0', 'exp': 'historical', 'filename': '/pf/b/b309141/work/CMIP5_DKRZ/CCCma/CanESM2/historical/fx/ocean/fx/r0i0p0/v20130119/sftof/sftof_fx_CanESM2_historical_r0i0p0.nc', 'frequency': 'fx', 'institute': ['CCCma'], 'long_name': 'Sea Area Fraction', 'mip': 'fx', 'modeling_realm': ['ocean'], 'original_short_name': 'sftof', 'preprocessor': 'test', 'project': 'CMIP5', 'recipe_dataset_index': 0, 'short_name': 'sftof', 'standard_name': 'sea_area_fraction', 'start_year': 2000, 'units': '%', 'variable_group': 'sftof'}} for variable tos during step mask_landsea ```

Possible solution

The main issues here are 2. and 3. I think both can be fixed if we don't raise an exception here and here if the shapes of the ancilliary files do not match the cube. If a preprocessor cannot handle the case that no fx files are found/the shapes differ, the errors should be raised in the corresponding preprocessing function.

It would be also nice to add some tests for these cases, similiar to the ones we had before.

schlunma commented 3 years ago

One additional thought: For some fx preprocessors we automatically add fx files if none are given (masking and land fraction weighting), for some we don't do this (area and volume statistics):

https://github.com/ESMValGroup/ESMValCore/blob/fee5d541dc871efe72ab972a6a52195c452d16eb/esmvalcore/_recipe.py#L471-L480

Would it make sense to unify this?

sloosvel commented 3 years ago

I think a good middle point solution would be to raise a warning instead of an exception during the broadcasting when adding the variables: if the shapes do not match, the variable will not be added into the cube. I think this should make both versions of the recipe work.

One additional thought: For some fx preprocessors we automatically add fx files if none are given (masking and land fraction weighting), for some we don't do this (area and volume statistics)

I don't know, this was the old behaviour as well. Not sure what was the reason for that.

schlunma commented 3 years ago

I think a good middle point solution would be to raise a warning instead of an exception during the broadcasting when adding the variables: if the shapes do not match, the variable will not be added into the cube. I think this should make both versions of the recipe work.

Yes, fully agree! But maybe only raise a warning if no fx files could be added, so that we don't get unnecessary warnings if sftof cannot be added for an atmospheric variables but sftlf is perfectly fine.

One additional thought: For some fx preprocessors we automatically add fx files if none are given (masking and land fraction weighting), for some we don't do this (area and volume statistics)

I don't know, this was the old behaviour as well. Not sure what was the reason for that.

Right. Any opinions on this @ESMValGroup/esmvaltool-coreteam? I think it would make sense to add them as default. This shouldn't affect any recipes since if there are issues with the fx variables (no dataset available, wrong shape) they will be ignored and the tool will fall back to the current calculation of the preprocessors.

schlunma commented 3 years ago

One more thing: I just realized that the one test is failing since #999: https://github.com/ESMValGroup/ESMValCore/runs/2509144545?check_suite_focus=true#step:7:1581 This shouldn't be hard to fix though.

jvegreg commented 3 years ago

One more thing: I just realized that the one test is failing since #999: https://github.com/ESMValGroup/ESMValCore/runs/2509144545?check_suite_focus=true#step:7:1581 This shouldn't be hard to fix though.

I don't know. I have this error locally but it is not detected by the CI (see #1099 as an example)

sloosvel commented 3 years ago

I think it would make sense to add them as default. This shouldn't affect any recipes since if there are issues with the fx variables (no dataset available, wrong shape) they will be ignored and the tool will fall back to the current calculation of the preprocessors.

I think that makes sense, especially for the volume_statistics step because there is only one option for the volume variable (volcello).

The only thing about this approach that is a bit inconvenient is that it would load and check fx_variables that are not needed, and if those variables do not pass the checks it could become a bit confusing ("Why is my recipe crashing because areacella does not pass the checks, if I only need areacello for my ocean variables"). But that can already happen for the mask_landsea step anyway.

valeriupredoi commented 3 years ago

maybe add an extra option a la use_other_available_fx_variables: True in recipe? We do want to use the NE masks, as it was before, and as @schlunma well points out, in case none of the fx files are available, but not for certain cases (like sftgif or volcello) :beer:

sloosvel commented 3 years ago

Maybe something like checking the mip of the variable? If it's related to an ocean realm it should start with O (and SI for the ice), and areacello and sftof could be set as the default options. And if that is not the case, set areacella and sftlf as the default ones.

schlunma commented 3 years ago

I wouldn't make this dependent on the variable. Sometimes it might be useful when an atmospheric variables has access to e.g. sftof, for example when they share the grid and no sftlf is available.

The only thing about this approach that is a bit inconvenient is that it would load and check fx_variables that are not needed, and if those variables do not pass the checks it could become a bit confusing ("Why is my recipe crashing because areacella does not pass the checks, if I only need areacello for my ocean variables"). But that can already happen for the mask_landsea step anyway.

You could add a try ... except CMORCheckError as exc ... raise CMORCheckError("...") from exc to the checks/fixes here and add a more detailed error description that contains the preprocessor that originally added the fx variable and a note that this error can be avoided by explicitly specifying an fx variable.

sloosvel commented 3 years ago

I think the easiest thing would be to check the shape after loading but before cmor checking. If the fx var cannot be broadcasted to the cube, discard the fx_variable and try the next one. I'll finish this next week though.

tomaslovato commented 3 years ago

@sloosvel @schlunma I would like to address also another inconsistency in preprocessors within _time.py using the aggregated_by iris function that do not preserve the fx field as cell_measures. A similar issue occurs also in _vertical_interpolate function that returns a new cube using _create_cube that is not handling the presence of cell_measures in the source cube.

As I understood from iris documentation cell_measures are not yet preserved in aggregated_by operation , so I thought about a workaround function, like the following

def aggregated_by(cube, coords, operator):
    """Compute iris aggregation preserving cell_measures.

    Parameters
    ----------
    cube: iris.cube.Cube
        input cube.

    coords: list of coord names
        Coordinate(s) over which group aggregation is to be performed.

    operator: str
        Select operator to apply.

    Returns
    -------
    iris.cube.Cube
        Cube aggregated upon operator

    """
    from ._ancillary_vars import add_fx_variables
    if cube.cell_measures():
        for measure in cube.cell_measures():
            fx_variables = cube.cell_measure(
                measure.standard_name).attributes['fx_variables']
            cube = cube.aggregated_by(coords, operator)
            cube = add_fx_variables(cube, fx_variables, CheckLevels.DEFAULT)
    else:
        cube = cube.aggregated_by(coords, operator)

    return cube

which requires the following change to _ancillary_vars.py to keep track of the original fx input file(s)

@@ -163,6 +163,8 @@ def add_fx_variables(cube, fx_variables, check_level):
         if isinstance(fx_info['filename'], str):
             fx_info['filename'] = [fx_info['filename']]
         fx_cube = _load_fx(fx_info, check_level)
+        # store fx_variables for reuse
+        fx_cube.attributes['fx_variables'] = fx_variables

         measure_name = {
             'areacella': 'area',

sloosvel commented 3 years ago

Thanks for poiting that out @tomaslovato. Since some fx vars can be time dependant, would it make sense to also perform the time statistics on them? Maybe those preprocessor steps could check the presence of cell measures and work on both the var cube and the measure cube, and then re-add the measures at the end. Instead of working with the original files again.

schlunma commented 3 years ago

@tomaslovato Unfortunately I think it's not that easy. The shape of the aggregated cube does (most likely) not match the shape of the fx files, so they cannot simply be added.

It does also not make sense to simply perform the same statistic that is used on the data on the fx files. For example, if you compute the sum of the data over time, it does not make sense to calculate the sum of the cell areas over time. I pretty sure that if it was that simple, iris would have implemented that.

A possible solution for time independent fx variables would be to specify only the relevant dimensions in add_cell_measure or add_ancilliary_variable that the fx variable actually cover (e.g. latitude and longitude) without broadcasting them before. If a statistic is calculated over a dimension that is not covered by the fx variables (e.g. time), aggregated_by and collapsed happily preserve the cell measures/ancialliary vars:

```py >>> time_coord = iris.coords.DimCoord([0, 1], var_name='time', units='day') >>> lat_coord = iris.coords.DimCoord([0.0], var_name='lat', units='rad') >>> year_coord = iris.coords.AuxCoord([1900, 1900], var_name='year') >>> x = iris.cube.Cube([[1.0], [2.0]], var_name='x', dim_coords_and_dims=[(time_coord, 0), (lat_coord, 1)]) >>> x.add_aux_coord(year_coord, 0) >>> area = iris.coords.CellMeasure([100.0], var_name='areacella') >>> x.add_cell_measure(area, 1) >>> print(x) x / (unknown) (time: 2; lat: 1) Dimension coordinates: time x - lat - x Auxiliary coordinates: year x - Scalar cell measures: areacella >>> new_x = x.aggregated_by('year', iris.analysis.MEAN) >>> print(new_x) x / (unknown) (time: 1; lat: 1) Dimension coordinates: time x - lat - x Auxiliary coordinates: year x - Scalar cell measures: areacella Cell methods: mean: year >>> new_x = x.collapsed('time', iris.analysis.MEAN) >>> print(new_x) x / (unknown) (lat: 1) Dimension coordinates: lat x Scalar coordinates: time: 0 day, bound=(0, 1) day year: 1900, bound=(1900, 1900) Scalar cell measures: areacella Cell methods: mean: time >>> ```

However, this would involve quite some changes in the code, since the broadcasting has to be done in other preprocesser functions.

For time-dependent fx variables the situation is more difficult. One approach could be to always use mean for the statistics of the fx variables, but I pretty sure this is not correct in general.

tomaslovato commented 3 years ago

@schlunma thanks for your feedback! I may guess that when I did the test with the time statistics my version of iris was not up to date ... as I tried again and cell_measures are preserved.

However the issue is still valid for _vertical_interpolate that creates a new cube and cell_measures are not accounted for. In some case by reordering the preprocessor we can work around this problem, but I'm not sure it will always be possible.

ESMValGroup / ESMValCore

Issues with preprocessors that use fx variables #1096

Issues

Possible solution