Zonal and meridional statistics preprocessor

ledm commented 5 years ago

This is an extension of Issue #1117 and PR #1123.

The following jobs need to be completed for the documentation paper (Table 1).

[x] Change zonal_means preprocesor function into two functions: zonal_statistics and meridional_statistics.
[x] Rename the mean_type to operator, like in the new area_statistics
[x] Create associated tests
[x] Add the new preprocessors to the ocean example recipe
- [x] Correct ocean example recipe with correct fx_files.
[x] Move the get_iris_analysis_operation function in a new file, _shared.py in the preprocesor folder.
- Is it sensible to put an artificial list to limit this function?
[x] Use this new function in the volume_statistics function, and add documentation to the header of that function.
[x] Add support for fx_files, and use cell volume to weight the means.
[x] Write tests which include fx_files for area_statistics, volume_statistics, zonal and meridional statistics.

ledm commented 5 years ago

I think that the absence of fx_files is a significant mathematical problem that we will need to address in the meridional_statistics function when using the mean operator.

In the zonal direction, the weighting is less of a problem.

ledm commented 5 years ago

One thing I'm working on here is adding fx_files to the recipe_ocean_example.yml recipe - basically making it accurate!

When ESMValTool is unable to locate fx_files, everything seems to continue as normal. @valeriupredoi, I would expect a fatal error to be given if an FX file was requested but not found. Is that not the case?

valeriupredoi commented 5 years ago

if the masking is done in the preprocessor, if fx files are not found, the code defaults to applying Natural Earth masks instead

ledm commented 5 years ago

It doesn't work like this in the area_statistics & volume_statistics functions. They too try to calculate the area or volume fields, but I feel that this is very inaccurate. We'd be better off if it fails when there's no fx_files. After all, area and volume descriptions are a CMIP5/6 requirement, right?

Should the error message be added in the get_input_fx_filelist function of _data_finder.py or in the area_statistics & volume_statistics functions?

ledm commented 5 years ago

So i guess the problem I'm seeing right now is: Why can't ESMValTool find the FX files anymore?

Another point is that none of the preprocessor tests include fx files. How can we write tests that include fx files?

ledm commented 5 years ago

Help @valeriupredoi! None of my recipes are able to find the fx_files anymore on my local desktop!

Can anyone else test a recipe that requires an fx_file to be loaded? @schlunma - have you had this problem?

valeriupredoi commented 5 years ago

@ledm I literally just ran a recipe with fx files grabbed from a local repo, all fine, are you using CMIP6? If so, there are missing input structures for fx files in config-developer

ledm commented 5 years ago

Which recipe?

Are you 100% sure that it did actually load and use the fx files? Several of the preprocessors have a fall-back option if they can't find the fx_files.

Also, in my case, I'm clearly seeing that ESMValTool can see the fx files, but the preprocessor can not find them!

valeriupredoi commented 5 years ago

well they are needed in the diagnostic so that'd fail if there was no fx file in the metadata file. I ran the recipe_autoassess_landurface_surfrad recipe, which one are you running?

ledm commented 5 years ago

I'm running this one:

https://github.com/ESMValGroup/ESMValTool/blob/version2_development_Zonal_Meridional_statistics_24/esmvaltool/recipes/recipe_ocean_example.yml

valeriupredoi commented 5 years ago

cheers, will test now with that

valeriupredoi commented 5 years ago

on the branch now, it finds the fx files and it says it's using them but I am running in this before anything else is done diagnostic-wise:

ValueError: Unknown preprocessor function 'zonal_statistics', choose from: download, fix_file,...

ledm commented 5 years ago

Yes, I'm working on that preprocessor in branch development_Zonal_Meridional_statistics_24. However, you should be able to test the recipe by commenting out that diagnostic (and the other zonal & Meridional diagnostics.) Thanks!

ledm commented 5 years ago

If you want to look at the same Core branch, it's in development_Zonal_Meridional_statistics_24.

ledm commented 5 years ago

So, my conclusion to this is that there's been a change in the way that recipes work. _Edit: Either that or fxfiles have never worked as I thought they did!

In order to get the recipe to work, I needed to move the fx_files: [areacello,] line from the diagnostics section to the preprocessor section! This encounters the problem that the fx_files grid are not passed to the other preprocessors. Ie, if the preprocessor includes some operator like a extract_region, the fx_file is not subjected to this function. This means that the fx data does not represent the preprocessed data anymore.

bouweandela commented 5 years ago

It doesn't work like this in the area_statistics & volume_statistics functions. They too try to calculate the area or volume fields, but I feel that this is very inaccurate. We'd be better off if it fails when there's no fx_files. After all, area and volume descriptions are a CMIP5/6 requirement, right?

What about observational datasets? The function should work for those too I think.

bouweandela commented 5 years ago

Another point is that none of the preprocessor tests include fx files. How can we write tests that include fx files?

Load a variable of interest + fx variable from file using iris, slice both cubes so they are small (i.e no more than a couple of points) and look at the data and coordinates. You can use this to create the cubes needed for the tests. Does that help?

bouweandela commented 5 years ago

Note that there are two ways in which the recipe uses fx_files: 1) Specified in the variable sections, this will just find fx files and pass their paths on to your diagnostic script. This is in the process of being changed #21, fx variables should be treated as any other variables. I see you are using this in your recipe. 2) Use by preprocessor functions: there is no need to mention fx files in the recipe at all, this is automatically filled in. The code that does this is here: https://github.com/ESMValGroup/ESMValCore/blob/development/esmvalcore/_recipe.py#L364-L422

ledm commented 5 years ago

It doesn't work like this in the area_statistics & volume_statistics functions. They too try to calculate the area or volume fields, but I feel that this is very inaccurate. We'd be better off if it fails when there's no fx_files. After all, area and volume descriptions are a CMIP5/6 requirement, right?

What about observational datasets? The function should work for those too I think.

How should we differentiate in the preprocessor whether a preprocessor requires an fx file or not? I think it should fail for model data when no FX file is available. Especially when the model uses irregular grids. There's an argument that regular grids are easier for Iris to calculate - however Bill Little mentioned that the iris.analysis.cartography.area_weights calculator might be removed in the near future anyway!

Another point is that none of the preprocessor tests include fx files. How can we write tests that include fx files?

Load a variable of interest + fx variable from file using iris, slice both cubes so they are small (i.e no more than a couple of points) and look at the data and coordinates. You can use this to create the cubes needed for the tests. Does that help?

Sort of. Will the new sliced data and fx file need to be added to the git repository in testing data or something?

Note that there are two ways in which the recipe uses fx_files:

Specified in the variable sections, this will just find fx files and pass their paths on to your diagnostic script. This is in the process of being changed #21, fx variables should be treated as any other variables. I see you are using this in your recipe.

Use by preprocessor functions: there is no need to mention fx files in the recipe at all, this is automatically filled in. The code that does this is here: https://github.com/ESMValGroup/ESMValCore/blob/development/esmvalcore/_recipe.py#L364-L422

I just spoke with V and I think it would be worth shelving this part of the work until #21 has completed.

bouweandela commented 5 years ago

How should we differentiate in the preprocessor whether a preprocessor requires an fx file or not? I think it should fail for model data when no FX file is available.

I think this is something we could implement in _recipe.py, because we know what dataset we are building a preprocessor for there. Can you make a separate issue for it?

Will the new sliced data and fx file need to be added to the git repository in testing data or something?

I would just create the iris cube in code and save it to a temporary file if you need to read it from file for your test.

ledm commented 5 years ago

Created issue #103.

valeriupredoi commented 5 years ago

@ledm your mysteriously vanishing fx files are behaving like this because you need to specify the fx_files key argument under each of those _statistics preprocesor calls in the recipe and not in the variable: the code in _recipe.py reads:

    for step in ('area_statistics', 'volume_statistics', 'zonal_statistics',
                 'meridional_statistics'):
        if settings.get(step, {}).get('fx_files'):
            settings[step]['fx_files'] = get_input_fx_filelist(
                variable=variable,
                rootpath=config_user['rootpath'],
                drs=config_user['drs'],
            )

the if settings.get(step, {}).get('fx_files') is null if fx files are not in the step's setting. Also note that whoever coded this up here (@bouweandela ?) forgot to add the needed

variable['fx_files'] = settings.get(step, {}).get('fx_files')

so that the block reads:

   for step in ('area_statistics', 'volume_statistics', 'zonal_statistics',
                 'meridional_statistics'):
        if settings.get(step, {}).get('fx_files'):
            variable['fx_files'] = settings.get(step, {}).get('fx_files')
            settings[step]['fx_files'] = get_input_fx_filelist(
                variable=variable,
                rootpath=config_user['rootpath'],
                drs=config_user['drs'],
            )

with that in all you fx ducks are in line (if the files actually exist on ESGF or your local DB).

There is however, some bad coding in the stats functions themselves because I get this (after some running):

ValueError: Fx area (216, 360) and dataset (48, 196, 111) shapes do not match.

which suggests to me that you are applying a 2d mask on 3d data :beer:

ledm commented 5 years ago

Thanks V!

Did we always have to specify the fx_file in the preprocessor - or is that a new thing? So far, I've always been specifying it in the diagnostics. Perhaps my recipes have been wrong all along (There was no documentation or examples at the time so I refuse to feel too guilty about that!)

I can't tell for sure whats happening with your shapes not matching - as I can't tell which preprocessor/diagnostic you're running. However, this may be a problem if the grid fx does not receive the same preprocessing as the dataset. For instance, if the extract_region preprocessor reduces the grid from (216, 360) to (196, 111), then we can no longer use the same areacello to match it!

Perhaps the solution is for extract_region to apply a mask instead of change the shape of the dataset?

valeriupredoi commented 5 years ago

I dunno, may be a mistake in my recipe, I added fx_files: [areacello, ] everywhere under each of the _statistic functions for testing, maybe I should have been more careful :grin:

No, there was no documentation and in fact it is the first time I see myself the addition of fx_files: in the preprocessor, quite a mystery on our hands :world_map:

valeriupredoi commented 5 years ago

well actually here's the bugger:

    custom_order: true
    extract_levels:
      levels: [0., 10., 100., 1000.]
      scheme: linear_horizontal_extrapolate_vertical
    extract_region:
      start_longitude: -80.
      end_longitude: 30.
      start_latitude: -80.
      end_latitude: 80.
    area_statistics:
      operator: mean
      fx_files: [areacello,

you are applying a (216, 360) mask on a (48, 10, 96, 21) cube and the catcher in the area_statistic is

    if grid_areas.shape != cube.shape[-2:]:
        raise ValueError('Fx area {} and dataset {} shapes do not match.'
                         ''.format(grid_areas.shape, cube.shape))

so yeah, you need to extract the area and levels the same way for the fx mask as for the data

ledm commented 5 years ago

Exactly, that's what I've been saying!

The preprocessor needs to be applied to the fx files as well as the actual dataset. This doesn't happen if the fx_files in the preprocessor section of the recipe. If they're given at the diagnostic section of the recipe, the area_statistics preprocessor doesn't receive them!

bouweandela commented 5 years ago

Also note that whoever coded this up here (@bouweandela ?) forgot to add the needed

Use git blame if you want to find out who wrote something, e.g.

$ git blame esmvalcore/_recipe.py | grep -A6 area_statistics
18f5d3882 esmvaltool/_recipe.py   (Manuel Schlund       2019-06-05 15:00:01 +0200  416)     for step in ('area_statistics', 'volume_statistics'):
baa4fb2f4 esmvaltool/_recipe.py   (Manuel Schlund       2019-02-08 12:23:00 +0100  417)         if settings.get(step, {}).get('fx_files'):
8267a24c3 esmvaltool/_recipe.py   (Lee de Mora          2019-01-29 11:07:32 +0000  418)             settings[step]['fx_files'] = get_input_fx_filelist(
52f0a8d74 esmvaltool/_recipe.py   (Manuel Schlund       2019-02-08 12:20:16 +0100  419)                 variable=variable,
61e642c36 esmvaltool/_recipe.py   (Lee de Mora          2019-01-23 14:31:21 +0000  420)                 rootpath=config_user['rootpath'],
8267a24c3 esmvaltool/_recipe.py   (Lee de Mora          2019-01-29 11:07:32 +0000  421)                 drs=config_user['drs'],
8267a24c3 esmvaltool/_recipe.py   (Lee de Mora          2019-01-29 11:07:32 +0000  422)             )

valeriupredoi commented 5 years ago

ah I shed a tear everytime I see git blame :grin:

ledm commented 5 years ago

I'm back to work on this issue, starting from @valeriupredoi's new PR #170.

I'm slowing coming to the conclusion that we can not sensibly produce zonal or meridional statistics for irregular grids.

This is because iris can't apply iris.analysis functions along one dimension of an irregular grid. It simply does not work at the moment. The iris.analysis functions can only be applied along the x-y axis of an array, not along an arbitrary latitude-like axis of an irregular grid. (I believe that this is a fairly permanent problem and this functionality won't be ready until Iris 3 is ready in 2023 or similar!)

This means that if we want to calculate zonal or meridional statistics for irregular grids, we need to regrid the data to a regular grid, then apply the zonal or meridional statistics to that regular grid. This requires ESMValTool to regrid the fx_files (areacello or volcello), or recalculate them from the new grid. Neither option is particularly precise. Furthermore, the regrid function does not currently apply to the fx_files at the moment (see this comment. )

So, I propose that we do not support zonal or meridional statistics for irregular grids. Any thoughts?

valeriupredoi commented 5 years ago

fix in #214 now :beer:

valeriupredoi commented 5 years ago

Nope! these changes are obsolete, all is done up in development already! Yay :tada:

ledm commented 4 years ago

I'm not sure that this issue was ever resolved. As @mattiarighi just pointed out to me in an email, the zonal_statisctics and meridional_statistics preprocessors are included in the ESMValTool v2 paper, so we need to get them in.

mattiarighi commented 4 years ago

Apparently it's just about renaming zonal_means to zonal_statistics. I'm doing this. Will submit in a second.

ledm commented 4 years ago

To summarise where this issue got to, we encountered 3 main problems with zonal and meridional statistics:

Other parts of the preprocessor chain are not applied to the fx files. This means that zonal and meridional statistics can only be applied on the global scale - not regional. Also discsussed in #405.
We were unable to convince iris to apply a mean along the zonal or meridional direction on irregular grids. It only worked with regular grids.
We were unable to satisfactorily determine what to do when an FX file was needed but not found. (I think @schlunma recent work might be able to address this #432.)

Looks like there's been progres with issues 1 and 3 here, but what about issue 2?

mattiarighi commented 4 years ago

zonal_means seems to be used successfully in two recipes: recipe_collins13ipcc.yml and recipe_flato13ipcc. So, the functionality is there, it is just a matter of naming.

For consistency with the other _statistics, we should rename it to zonal_statistics.

ESMValGroup / ESMValCore

Zonal and meridional statistics preprocessor #24