Open bouweandela opened 8 months ago
TODO: finish typing this
:rofl:
Thank you very much Bouwe for writing this up, this looks really great! I am really looking forward to these changes ❤️
I have three comments:
/etc/esmvaltool/
. This would allow us to add default configurations for commonly used HPC systems like Levante, and users would not have to mess around with configuration on these machines.preprocessors: # NOT arbitrary
regrid: # arbitrary
regrid: # NOT arbitrary
target_grid: 2x2 # NOT arbitrary
scheme: linear # NOT arbitrary
Same goes for the definition of diagnostic scripts, variables, etc. My suggestion would be to avoid patterns like
projects:
CMIP6:
data:
CMIP6-local:
type: esmvalcore.local.LocalDataSource # this could be omitted for local data
path: /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ
dirname: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
CMIP6-ESGF:
path: ~/climate_data
dirname: '{project}/{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
and use something like
projects:
- name: CMIP6
data:
- type: esmvalcore.local.LocalDataSource # this could be omitted for local data
path: /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ
dirname: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
- path: ~/climate_data
dirname: '{project}/{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
instead. So basically, keys can only have fixed (and preferably well-documented) values.
the general structure is marvellous! Like @schlunma points out, I think the devil's in the details: take for instance the basic, Joe the Scientist user config file: I would add two sections and tweak a couple things
[general]
output_dir: ~/esmvaltool_output
auxiliary_data_dir: ~/auxiliary_data
max_parallel_tasks: null
log_level: info
remove_preproc_dir: true
config_developer_file: null
[data]
rootpath: ~/climate_data
directory-structure-type: # not DRS nor Data Reference Syntax - that is a terribly confusing, highly technical term
CMIP3: ESGF
CMIP5: ESGF
CMIP6: ESGF
CORDEX: ESGF
obs4MIPs: ESGF
what I would also propose (and I can help with that!) is we create checklists in case of a number of issues: our configuration is becoming very complex - in aviation, where a plane has a very complex configuration, they have checklists in case of faults: we could do that too: eg data is not found: check this config file and that config file, these values for those keys etc.
I like the idea of a check list V! Very useful to cut down the time it takes to spot what I've missed.
Suggestion from @k-a-webb: Add a command line option to show the final merged configuration, e.g.
esmvaltool config show
This will make it easier to find potential problems arising from merging the different configuration files.
Flagging @charles-turner-1
Here is the design for the new way of organizing our configuration files, that we decided on at the last workshop. We will use a simplified version of the Dask configuration system. That means that the configuration will consist of nested dictionaries that are organized per component of the tool. There is a default configuration shipped with the ESMValCore and ESMValTool, that will be updated with the configuration specified by the user through one or more YAML files in a user-specified directory.
The configuration will be stored in arbitrarily named files in a directory, e.g.
~/.config/esmvaltool
. The users can decide if they want to use multiple files or keep everything in one file. This directory will be configurable from the command line, or by using an environmental variable, or possibly from the Python API as well. Having a configuration that is merged from multiple files into a single dictionary, like Dask has, will make it easy for us to provide a command likeesmvaltool config
that will create relevant example configuration files for the user, instead of a single large configuration file with commented out details.For a smooth transition, we will keep supporting the existing configuration key: value pairs in config-user.yml, but add new ones as well.
Extensive example
Below is an example of what the future configuration file(s) could look like. Note that this very extensive to show all possibilities, real users would very likely need something much smaller.
config.yml (from current config-user.yml)
dask.yml (from current dask.yml, see #2040 and #2369)
esgf-pyclient.yml (from current config-user.yml and esgf-pyclient.yml)
data-dkrz.yml
This would replace
rootpath
anddrs
in config-user.yml and the relatedinput_dir
andinput_file
in config-developer.yml). This has the advantage that all information is available in one place, making it easier to understand. See https://github.com/ESMValGroup/ESMValCore/pull/1894 for previous discussion. The format is also extensible, to add support for e.g. intake-esm or intake-esgf (see next example).data-intake.yml (example of how intake-esm could be configured #31)
projects.yml (would replace the CMOR table related and
output_file
settings in config-developer.ymlextra_facets.yml (see esmvalcore/config/extra_facets for defaults)
references.yml (see config-references.yml for current defaults, this would finally make it easier to avoid #28)
esmvaltool.yml (from current config-user.yml with potential future diagnostics package specification)
Simple example
This example shows what my current configuration on my laptop would look like in the new format
config.yml
data-esgf.yml
data-obs.yml
dask.yml
Compute cluster example
On a compute cluster, e.g. Levante, the simple example above would be extended with an extra data sources file:
data-levante.yml
About replacing the
rootpath
,drs
,input_dir
, andinput_filename
settingsI realize that the way to specify rootpath/dirname/filename looks more complicated than what we currently have in the above examples. What I like about it is that it is explicit and simple: there is no longer a need to find out about the 'hidden' config-developer.yml file to understand what this is actually doing, and there is no longer the complicating factor that there is a lot of magic going on (is this settings a string or a list, what does
default
mean?) and I think that will benefit new users. See also https://github.com/ESMValGroup/ESMValCore/pull/1894#issuecomment-1428667217 for previous discussions on the topic.Timeline for implementation
To set the expectations: this design is intended as a long-term strategy that can give guidance when making smaller improvements to the tool, not something that can immediately be implemented. Currently, no member of the @ESMValGroup/technical-lead-development-team has a funded proposal in which a large task like this could be taken on.
Ideas welcome
@ESMValGroup/esmvaltool-developmentteam If you have ideas how to make this better, please share them in a comment below or at one of the community meetings.