Integrate reformat scripts for the observations in the new framework

mattiarighi commented 6 years ago

To reformat observational data in the CMOR format used by the tool, a number of dataset-specific reformat scripts is available (reformat_scripts/obs/).

In v1.0, they were executed by running the namelist_reformat_obs namelist. This has to be ported to the new framework.

duncanwp commented 6 years ago

Hey, if you're interested in reading the observations into Iris cubes now, rather than reformatting on disk you could use CIS. It was designed to do exactly this and has a simple plugin system for fixing all the things like metadata and coordinate systems in these datasets, see e.g. http://cis.readthedocs.io/en/stable/plugin_development.html. It can already read AERONET, MODIS and the aircraft datasets (HIPPO, SALTRACE, CONCERT etc) in your list out-of-the box.

I'd be very happy to chat about how this could fit in to ESMValTool.

mattiarighi commented 6 years ago

I've started a branch for this issue version2_reformat_obs.

For now I've just created a directory /esmvaltool/cmor/cmorize_obs, where I moved one of the existing reformat scripts (cmorize_obs_AURA-TES.ncl) and some related functions. I've also created a dedicated recipe recipe_cmorize_obs.yml which simply calls this script as it happens with the diagnostic.

This approach has some problems, since for example interface information (such as the input path to the obs data to be cmorized specified in the config file) is not passed to the script.

I think the suggestion by @bouweandela to create a specific executable for such task would be much better, also in view of the future implementation of the CIS plugin for cmorizing observations (see #513).

One option could be:

cmorize_obs [dataset]

jvegreg commented 6 years ago

Another option is to use the same executable, but using subcommands (argparse support it):

esmvaltool cmorize_obs ${DATASET}

This way we can extend esmvaltool functionality while maintaining a single entry point

esmvaltool run recipe.yml
esmvaltool available_recipes

mattiarighi commented 6 years ago

That's also fine. I think we should avoid having an extra recipe for cmorizing obs as in v1, it's a bit overkill.

bouweandela commented 6 years ago

I've created a simple tool prepare_observations in branch version2_prepare_observations which uses the cis package to read a list of files and store the variables found in those files in an output directory, one file per variable. It can be installed by running the usual pip install -e . command. Make sure to first update your conda environment, because the cis package installed with pip is outdated and apparently broken. This tool allows us to use cis for reformatting observational data. The reformat scripts could then be implemented as cis plugins.

@mattiarighi Can you check how useful this is and if it works for you/matches your expectations? I did not have any observational data to test it with, so there are probably things about it that do not work (I tested with model data). Maybe you can test it with one of the observational datasets that is already supported by cis? And next try to port one of the esmvaltool 1 reformat scripts to a cis plugin, as described here and see if this is a good experience? The prepare_observations tool is very simple now, I expect we may need some extra features, but it would be nice if we could keep it to just a thin wrapper around cis, if we find that it suits our needs.

duncanwp commented 6 years ago

Thanks @bouweandela - this is looking really good.

I've now updated CIS on PyPi to the latest version (1.6.0), I tend not to update it as regularly as I recomend people install using conda, but I should get into the habit of keeping it up to date!

bascrezee commented 6 years ago

Thanks @duncanwp for updating CIS on PyPi. I tried installing it, with 'pip install' into my ESMVal environment. I get the following error message:

Collecting iris>=1.8.0 (from cis==1.6.0)
  Could not find a version that satisfies the requirement iris>=1.8.0 (from cis==1.6.0) (from versions: 1.0.4)
No matching distribution found for iris>=1.8.0 (from cis==1.6.0)

I have already installed Iris into my existing environment:

conda list
# ...
# iris                      2.1.0                    py27_3    conda-forge
# ...

So I don't really get what this error message means. Thanks for any help !

duncanwp commented 6 years ago

Unfortunately the 'iris' package on PyPi is not the correct package (it should be scitools-iris). I'll need to update the CIS setup.py and re-release, but this may not happen until the next major release (in the next couple of months).

In the meantime, if you already have iris installed, you can install with the 'no-dependencies' flag.

Apologies for the inconvenience!

mattiarighi commented 6 years ago

I keep getting requests from users on how to reformat observational data in version 2.

I think we should add the possibility to run the v1 reformat scripts in v2 (as suggested above).

The method based on CIS implemented by @bouweandela (see above) actually works, but we do not have the resources now to translate our set of reformat scripts into CIS plug-ins (as also discussed above).

Moving to CIS on the long term would be great and we should definitely do that, but at this stage we urgently need to give the users the possibility to generate CMORized observations with the existing scripts and to contribute new ones. The easiest and fastest solution would be to use the same framework of v1, which is very flexible and has multi-language support (as for the diagnostics).

I can take care of porting all existing reformat scripts to v2, also fixing metadata issues which Iris2 is raising. I just need someone to help with the framework: the only thing these scripts need is to have access to interface information, such as input/output paths and logging functions.

valeriupredoi commented 6 years ago

hey guys @mattiarighi @bouweandela @jvegasbsc - here's my take on this (started work on this branch, no PR yet, changes are too messy for a PR just yet: https://github.com/ESMValGroup/ESMValTool/tree/version2_reformat_obs_workflow) - the workflow for obs reformatting:

in _recipe.py when time comes to build the input_files dict add an option from config-user called apply_reformat - if True, get the input files dictionary through a reformat.py script that will, in turn, run the reformatting if the dataset is in the reformat library; then return it in the same shape, only this time around files will have changed location and are reformatted;
this way we can run the original v1 reformat_scripts (I guess we want only the obs ones right?) and dump the output in a safe location that the code can check if they already exist, the reformatting+saving is ignored;

Now, does anybody remember how the run_executable() function was operating in v1 so we can run all those ncl, csh etc codes? I did port that crazy function last year in the first ever v2 but beats me if I remember.

How's this sound?

valeriupredoi commented 6 years ago

btw this is a massive hack and we will have to pythonize the reformat scripts sometime in the near future :grin:

mattiarighi commented 6 years ago

It sounds quite complicated, do we need to go through the esmvaltool workflow for that? As I said above, the only thing the reformat_obs need are the config-user information about input and output paths and (maybe) the logging functions. They run as stand-alone scripts mostly independent of the rest of the tool.

Pythonizing all scripts is planned in the long term in the CIS framework, what we need now is a quick solution to allow porting the almost 90 existing scripts from v1. I also would like to keep the multi-language support, since we can expect users writing their diags in NCL/R also wanting to reformat their obs using the same language.

valeriupredoi commented 6 years ago

ok - new tentative workflow - talked to @mattiarighi - see https://github.com/ESMValGroup/ESMValTool/pull/666 damn, hope the PR number doesn't do any harm :laughing:

mattiarighi commented 5 years ago

Thanks to the Amazing @valeriupredoi :tm:, we now have a cmorizer for the observations in version 2 :applause:

It can be executed by:

cmorize_obs -c config-user.yml -o [DATASET1,DATASET2,...]

For each of the given datasets, it will look for the data to be cmorized in [RAWOBS]/Tier[1-2]/DATASET, where [RAWOBS] is a path given in config-user.yml, and apply the corresponding cmorizer script in esmvaltool/utils/cmorizers/obs/ (NCL or Python). The cmorized output will be saved in [output_dir]/cmorize_obs_YYYYMMDD_DDMMSS/Tier[1-2]/DATASET.

At present, two cmorizers are included, AURA-TES and ESACCI-LANDCOVER, more will follow (priority will be given to the dataset used by the recipes already ported to v2).

valeriupredoi commented 5 years ago

for Python cmorizer scripts one needs to build a cmorization(in_dir, out_dir) function that takes input and output dirs as args; we can extend that when we start working on Python cmorizer scripts; cheers @mattiarighi for making me a Trade Mark :grin:

mattiarighi commented 5 years ago

Proposed workflow for porting the cmorization script from v1:

create a branch
make sure that the raw data source reported in the script header is still active
download the raw data again (there are often updates or time-coverage extensions)
adapt the script to v2 and test it
make sure the cmorized data are read in by v2 without errors (use a simple recipe without preprocessor and without diag).
open PR and wait for approval :innocent:
once merged check the dataset(s) in the list below

These are the observations used in the recipes currently available in version2_development. Once the cmorizers for these dataset are available, this issue can be closed:

Tier 2

[x] ESACCI-AEROSOL
[x] ESACCI-CLOUD
[x] ESACCI-FIRE
[x] ESACCI-LANDCOVER
[x] ESACCI-OZONE
[x] ESACCI-SOILMOISTURE - v4.2 available (replacing previous v2.2)
[x] ESACCI-SST
[x] HadCRUT - renamed to HadCRUT3, some minor differences due to updated input data
[x] HadCRUT4 - PR #860, time range extended to 2018, some minor differences in Oct.-Dec. 2017
[x] HadISST - time range extended to 2017
[x] NCEP - time range extended to 2018
[x] PATMOS - PR #861, renamed to PATMOS-x; time range extended to 2016; resolution much higher (0.1°)
[x] WOA - PR #834

Tier 3

[x] AURA-TES - newer version available
[x] CERES-SYN1deg
[x] ERA-Interim - data updated and time-series extended to 2018 (very small differences w.r.t. v1 data)
[x] MODIS - data updated (some differences in all variables), fixed bug in lwpStderr
[x] NIWA - renamed to NIWA-BS and updated to v3.3Patched (significant differences in the NH polar regions w.r.t. v1 data)
[x] UWisc - PR #861

valeriupredoi commented 5 years ago

@mattiarighi that list looks so pretty :grin: It would be worth running your new cmorization checking recipe with iris 2 via #832 given you found that BNU-ESM inconsistency that is in fact a problem in iris

mattiarighi commented 5 years ago

Good point!

ESMValGroup / ESMValTool

Integrate reformat scripts for the observations in the new framework #232