ESMValGroup / ESMValTool

ESMValTool: A community diagnostic and performance metrics tool for routine evaluation of Earth system models in CMIP
https://www.esmvaltool.org
Apache License 2.0
228 stars 128 forks source link

Several minor questions about ESMValTool recipes + preprocessor. #604

Closed ledm closed 6 years ago

ledm commented 6 years ago

Hi all,

I have several questions about ESMValTool. @valeriupredoi suggested that it might be good to share theese here too. Also, I'd like to suggest that we copy the questions and answers (when they arrive) into a new FAQ page on the wiki. I'm happy to do this if you agree.

  1. Can you run multiple recipes in the same esmvaltool instance?

I'm asking this because I see it as a modular way to look at several different fields in a row. Ie, say we have several recipes, and we only want to switch a few on at a time. Alternatively, a really easy way to switch off several diagnostics of a recipe without having to comment dozens of lines would be really helpful.

  1. Can you create sets of models/datasets in a recipe?

For instance, I want diagnostic A to look at group A of datasets, and diagnostic B to look at group B of datasets in the same recipe?

  1. Is there a way to apply a CMIP5 fix to every model?

The oxygen fix for HadGem2-ES and HadGEM2-CC needs to be applied to all models. I'd rather not have to copy paste everything, as this can allow bugs to creep in, and makes it difficult to patch in the future, if needed.

  1. If there a way to reduce memory used by Iris in the preprocessor?

I want to calculate the global volume-weighted average termperature for an entire model run (ie, monthly 3D data for 150 years). At the moment, the volume_average preprocessor (which I wrote, admitedly - mea culpa!) wants to load the entire temperature field over the entire time series, and this crashes my computer. Is there an easy way to force iris to chunk it into smaller manageable pieces?

  1. Is it possible for preprocessors to change the units? The depth integration function converts concentration in volume into concentration in area. This means that the units go from mol m-3 to mol m-2, and the units need to be changed. I want to run this command:
    result.units = Unit('m') * result.units

    but this doesn't work.

Thanks for all the help. Please let me know if you're happy with me adding these questions (once they're answered) into a FAQ.

Lee

valeriupredoi commented 6 years ago

Hi @ledm here is my set of answers:

Can you run multiple recipes in the same esmvaltool instance?

No, from the command line the executable accepts a single recipe argument; you can group the recipes you want into a single one and switch off what you dont want to run; the problem with multiple recipes is the task distribution, especially when running with multiple threads - that would produce a lot of tasks that may slow the show to a halt. Alternatively, you may be able to write a script that loops over a list of recipes and runs each only when the previous one has finished? Ideas from @bouweandela @mattiarighi and @jvegasbsc most welcome (for this and the other questions)

Can you create sets of models/datasets in a recipe?

This is possible but only after preprocessing, I am just about to push a generalized Autoassess diagnostics wrapper that does exactly this, but only after the whole set of datasets has passed through preproc

Is there a way to apply a CMIP5 fix to every model?

No clues - @jvegasbsc

If there a way to reduce memory used by Iris in the preprocessor?

Yes, use slicing (there is a neat way slicing works for lat-lon slices - @jvegasbsc has implemented it in sliced regridding). Have a look at iris cube slices; I'd suggest further slicing to be implemented by you (temporal and vertical level slicing), much less memory (albeit possibly slower but not by much, iris 1.13 is slow anyway, iris 2.1 should be much faster but only if dask is configured for), have a look at the multimodel module as well, t-plev-lat-lon slicing is well used there

Is it possible for preprocessors to change the units?

Units can be changed when applying the CMOR fixes - probably the best and only place to do that, if you change units in the preprocessor, certain iris operations like concatenate, add/subtract etc will go tits up, unless changed for the cubes that are operated on, but in that case you may as well do it in the diagnostic

bouweandela commented 6 years ago

Hi Lee,

In addition to V.'s, here are my answers: 1) No, you cannot run more than one recipe at a time at the moment. This could be implemented, but would probably be a few days of work and I don't really see the advantage. To disable a particular diagnostic script in a recipe, you can replace the name of the script file with null. Disabling entire diagnostics can be done by commenting them out. 2) Yes, in various ways, can you be more specific? An example can be found in recipe_perfmetrics.yml 3) Not at the moment. Of course we could be default fix all metadata using the cmor table, but this is a bit risky: we somehow need to make sure that we load the right cube. At the moment this is quite strict: the right cube has to have the right standard name, but we could try to implement some more flexible approach and then fix the metadata after loading, this would avoid the expensive file copy needed by fix_file. 4) The code for that particular preprocessor function looks good, I think it should be able to lazily evaluate, i.e. without loading all data in memory. Have you tried with iris 2 and only that preprocessor function? There might be other preprocessor functions that try to load everything (e.g. by accessing the cube's data attribute or putting all data in a numpy array). Manual slicing should theoretically not be needed, the iris aggregator should take care of that by using dask (in iris 2). 5) This would require a minor modification to the code so the unit is read from the cube instead of from the cmor table of the input data when extracting the metadata, but should be possible. Please make an issue if you would like this functionality.

ledm commented 6 years ago

Hi, thanks for the answers.

  1. Thanks, I guess this was just an idea for a potential monitoring strategy. I'm happy to rule it out for now.

  2. As an example, I'd like to run the following recipe outline:

    
    datasets:
    # working datasets
    dataset_group: A
    - {dataset: HadGEM2-CC, project: CMIP5, mip: Omon, exp: historical, ensemble: r1i1p1, start_year: 1960, end_year: 2004}
    - {dataset: HadGEM2-ES, project: CMIP5, mip: Omon, exp: historical, ensemble: r1i1p1, start_year: 1960, end_year: 2004}

dataset_group: B

diagnostics:
diag_timeseries_surface_Omon:
variables: tos: # Monthly Temperature ocean 2D preprocessor: timeseries_surface_average field: TO2Ms dataset_group: A scripts: ...

diag_timeseries_surface_Oyr:
variables: chl: # annual surface chl preprocessor: timeseries_surface_average field: TO2Y dataset_group: B scripts: ....


I hope that this makes sense.  At the moment, marine biogeochemistry needs to be in its own recipe, as there is almost no monthly BGC data and almost no annual ocean circulation data.. 

3. It's a shame that this expensive copy is needed, especially as the problem is in the files themselves! Would it be possible to create a generic o2_fix_file.py file containing the fix_file method, then import that method into each model's fix_file method? This means that there'd only be one copy of the actual fix code. 

4. Thanks! I'm currently using version 1.13. How do I try it out with iris 2 - is that a change to environment.yml?  I've added a manual slicing method using a iris.cube.cubelist, but it's causing trouble elsewhere. The methods in regrid and multi_model are a lot more complex than my (naive) efforts so far.

5. Thanks, I have created this an an issue in issue #605.
bouweandela commented 6 years ago

2: This looks the same as recipe_perfmetrics.yml: just write it like this and it should work:


diagnostics:  
  diag_timeseries_surface_Omon:   
    variables:
      tos: # Monthly Temperature ocean 2D
        preprocessor: timeseries_surface_average
        field: TO2Ms
    additional_datasets:
      - {dataset: HadGEM2-CC, project: CMIP5, mip: Omon, exp: historical, ensemble: r1i1p1, start_year: 1960, end_year: 2004}
      - {dataset: HadGEM2-ES, project: CMIP5, mip: Omon, exp: historical, ensemble: r1i1p1, start_year: 1960, end_year: 2004}
    scripts: ...

  diag_timeseries_surface_Oyr:   
    variables:
      chl: # annual surface chl
        preprocessor: timeseries_surface_average
        field: TO2Y
    additional_datasets:
      - {dataset: HadGEM2-CC, project: CMIP5, mip: Oyr, exp: historical, ensemble: r1i1p1, start_year: 1960, end_year: 2004}
      - {dataset: HadGEM2-ES, project: CMIP5, mip: Oyr, exp: historical, ensemble: r1i1p1, start_year: 1960, end_year: 2004}
    scripts: ....

3) I've created an issue for further discussion here: #606. Note that our cmor tables are just a copy of the tables here: https://github.com/PCMDI/cmip5-cmor-tables, so if you think there is a mistake, you should probably report it there. 4) Iris 2 uses dask for arrays and dask arrays can be larger than memory, as described here http://dask.pydata.org/en/latest/array.html, so if you stick with either manipulating the cubes without accessing the data attribute or when you do use the data attribute, make sure to only use it with dask arrays, I think it should work without manually defining slices. For upgrading to iris 2 we have an open issue #451, but for experimenting you can just locally upgrade the installed version of iris.

bouweandela commented 6 years ago

In fact, if it is only the mip key that is different, you could probably even write it like this:

datasets:
  - {dataset: HadGEM2-CC, project: CMIP5, exp: historical, ensemble: r1i1p1, start_year: 1960, end_year: 2004}
  - {dataset: HadGEM2-ES, project: CMIP5, exp: historical, ensemble: r1i1p1, start_year: 1960, end_year: 2004}

diagnostics:  
  diag_timeseries_surface_Omon:   
    variables:
      tos: # Monthly Temperature ocean 2D
        preprocessor: timeseries_surface_average
        field: TO2Ms
        mip: Omon
    scripts: ...

  diag_timeseries_surface_Oyr:   
    variables:
      chl: # annual surface chl
        preprocessor: timeseries_surface_average
        field: TO2Y
        mip: Oyr
    scripts: ....

because variable and dataset entries are combined to determine which data to use. If a key is present in both the variable and dataset map, the dataset key takes precedence.

valeriupredoi commented 6 years ago

@ledm

Thanks! I'm currently using version 1.13. How do I try it out with iris 2 - is that a change to environment.yml? I've added a manual slicing method using a iris.cube.cubelist, but it's causing trouble elsewhere. The methods in regrid and multi_model are a lot more complex than my (naive) efforts so far. <<<

to add to what Bouwe says, iris 1.13 is currently default for ESMValTool due to the very issue Bouwe mantioned and dask. For simple out-the-box cube slicing have a looksee here: https://scitools.org.uk/iris/docs/v2.0/userguide/subsetting_a_cube.html specifically 5.2

mattiarighi commented 6 years ago

It looks like some questions have been answered and an issue has been open for the unanswered ones. @ledm can we close this?

ledm commented 6 years ago

All points have been addressed, and I understand the solutions, but haven't necessarily been able to successfully apply them yet. If we could keep it open a little longer so that the conversation can continue, that would be very helpful, thanks.