New configuration file plan

bouweandela commented 8 months ago

Here is the design for the new way of organizing our configuration files, that we decided on at the last workshop. We will use a simplified version of the Dask configuration system. That means that the configuration will consist of nested dictionaries that are organized per component of the tool. There is a default configuration shipped with the ESMValCore and ESMValTool, that will be updated with the configuration specified by the user through one or more YAML files in a user-specified directory.

The configuration will be stored in arbitrarily named files in a directory, e.g. ~/.config/esmvaltool. The users can decide if they want to use multiple files or keep everything in one file. This directory will be configurable from the command line, or by using an environmental variable, or possibly from the Python API as well. Having a configuration that is merged from multiple files into a single dictionary, like Dask has, will make it easy for us to provide a command like esmvaltool config that will create relevant example configuration files for the user, instead of a single large configuration file with commented out details.

For a smooth transition, we will keep supporting the existing configuration key: value pairs in config-user.yml, but add new ones as well.

Extensive example

Below is an example of what the future configuration file(s) could look like. Note that this very extensive to show all possibilities, real users would very likely need something much smaller.

config.yml (from current config-user.yml)

output_dir: ~/esmvaltool_output
auxiliary_data_dir: ~/auxiliary_data
max_parallel_tasks: null
log_level: info
remove_preproc_dir: true

# This could be replaced by the section `data` under `projects` (see below) in the future
config_developer_file: null
rootpath:
  default: ~/climate_data
drs:
  CMIP3: ESGF
  CMIP5: ESGF
  CMIP6: ESGF
  CORDEX: ESGF
  obs4MIPs: ESGF

dask.yml (from current dask.yml, see #2040 and #2369)

dask:
  client:
  run: compute  # Start the `compute` cluster defined below
  clusters:
    local:
      type: distributed.LocalCluster
      n_workers: 2
      threads_per_worker: 2
      memory_limit: 4GiB
    compute:
      type: dask_jobqueue.SLURMCluster
      queue: compute
      account: bk1088
      cores: 64
      memory: 4GiB
      processes: 32
      interface: ib0
      local_directory: "/scratch/b/b381141/dask-tmp"
      n_workers: 32
    basic:
      type: default
      scheduler: threaded
      num_workers: 4
    debug:
      type: default
      scheduler: single-threaded

esgf-pyclient.yml (from current config-user.yml and esgf-pyclient.yml)

esgf:
  search_esgf: when_missing
  download_dir: ~/climate_data
  search_connection:
    expire_after: 2592000  # the number of seconds in a month
    URLs:
      - 'https://esg-dn1.nsc.liu.se/esg-search'
      - 'https://esgf.ceda.ac.uk/esg-search'
      - 'https://esgf-data.dkrz.de/esg-search'
      - 'https://esgf-node.llnl.gov/esg-search'
  logon:
    hostname: "esgf-data.dkrz.de"
    username: "cookiemonster"
    password: "Welcome01"

data-dkrz.yml

This would replace rootpath and drs in config-user.yml and the related input_dir and input_file in config-developer.yml). This has the advantage that all information is available in one place, making it easier to understand. See https://github.com/ESMValGroup/ESMValCore/pull/1894 for previous discussion. The format is also extensible, to add support for e.g. intake-esm or intake-esgf (see next example).

projects:
  CMIP6:
    data:
      CMIP6-local:
        type: esmvalcore.local.LocalDataSource  # this could be omitted for local data
        path: /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ
        dirname: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
        filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'

data-intake.yml (example of how intake-esm could be configured #31)

projects:
  CMIP6:
    data:
      CMIP6-intake-esm:
        type: esmvalcore.intake.IntakeDataSource
        file: '/pool/data/Catalogs/levante-cmip6.json'
        facets:
          # mapping from recipe facets to intake-esm catalog facets
          activity: activity_id
          dataset: source_id
          ensemble: member_id
          exp: experiment_id
          grid: grid_label
          institute: institution_id
          mip: table_id
          short_name: variable_id
          version: version

projects.yml (would replace the CMOR table related and output_file settings in config-developer.yml

projects:
  CMIP6:
    cmor_table:
      strict: true
      type: 'CMIP6'
    output_file: '{project}_{dataset}_{mip}_{exp}_{ensemble}_{short_name}_{grid}'
  XYZ_project:
    # Example of a custom project
    cmor_table:
      strict: false
      type: CMIP6
      path: /path/to/CMOR_table/
      default_table_prefix: XYZ_
    output_file: '{project}_{dataset}_{short_name}'

extra_facets.yml (see esmvalcore/config/extra_facets for defaults)

projects:
  CMIP5:
    extra_facets:
      'ACCESS1-0':
        '*':
          '*':
            institute: ['CSIRO-BOM']
      'ACCESS1-3':
        '*':
          '*':
            institute: ['CSIRO-BOM']
      'bcc-csm1-1':
        '*':
          '*':
            institute: ['BCC']

references.yml (see config-references.yml for current defaults, this would finally make it easier to avoid #28)

references:
  citation_dir: ~/ESMValTool/esmvaltool/references
  authors:
    andela_bouwe:
      name: Andela, Bouwe
      institute: NLeSC, Netherlands
      email: b.andela@esciencecenter.nl
      orcid: https://orcid.org/0000-0001-9005-8940
      github: bouweandela
    schlund_manuel:
      name: Schlund, Manuel
      institute: DLR, Germany
      email: manuel.schlund@dlr.de
      orcid: https://orcid.org/0000-0001-5251-0158
      github: schlunma

esmvaltool.yml (from current config-user.yml with potential future diagnostics package specification)

diagnostics:
  package: esmvaltool
  package_path: ~/ESMValTool
  output_file_type: png

Simple example

This example shows what my current configuration on my laptop would look like in the new format

config.yml

output_dir: ~/esmvaltool_output
auxiliary_data_dir: ~/auxiliary_data
max_parallel_tasks: 1

data-esgf.yml

esgf:
  download_dir: ~/climate_data
  search_esgf: always
  search_connection:
    expire_after: 864000 # 10 days
    urls:
      - 'https://esg-dn1.nsc.liu.se/esg-search'
      - 'https://esgf-data.dkrz.de/esg-search'

projects:
  CMIP6:
    data:
      CMIP6-ESGF:
        path: ~/climate_data
        dirname: '{project}/{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
        filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
      CMIP5-ESGF:
        path: ~/climate_data
        dirname: '{project.lower}/{product}/{institute}/{dataset}/{exp}/{frequency}/{modeling_realm}/{mip}/{ensemble}/{version}'
        filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}*.nc'
      obs4MIPs-ESGF:
        path: ~/climate_data
        dirname: '{project}/{dataset}/{version}'
        filename: '{short_name}_*.nc'

data-obs.yml

projects:
  native6:
    data:
      native6-local:
        path: ~/climate_data
        dirname: 'Tier{tier}/{dataset}/{version}/{frequency}/{short_name}'
        filename: '*.nc'

dask.yml

dask:
  run: local
  clusters:
    local:
      type: distributed.LocalCluster
      n_workers: 2
      threads_per_worker: 2
      memory_limit: 4GiB
    basic:
      type: default
      scheduler: threaded
      num_workers: 2
    debug:
      type: default
      scheduler: single-threaded

Compute cluster example

On a compute cluster, e.g. Levante, the simple example above would be extended with an extra data sources file:

data-levante.yml


projects:
  CMIP6:
    data:
      CMIP6-levante:
        path: /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ
        dirname: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
        filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
  CMIP5:
    data:
      CMIP5-levante:
        path: /work/bd0854/DATA/ESMValTool2/CMIP5_DKRZ
        dirname: '{institute}/{dataset}/{exp}/{frequency}/{modeling_realm}/{mip}/{ensemble}/{version}/{short_name}'
        filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}*.nc'
  native6:
    data:
      native6-levante:
        path: /work/bd0854/DATA/ESMValTool2/RAWOBS
        dirname: 'Tier{tier}/{dataset}/{version}/{frequency}/{short_name}'
        filename: '*.nc'

About replacing the `rootpath`, `drs`, `input_dir`, and `input_filename` settings

I realize that the way to specify rootpath/dirname/filename looks more complicated than what we currently have in the above examples. What I like about it is that it is explicit and simple: there is no longer a need to find out about the 'hidden' config-developer.yml file to understand what this is actually doing, and there is no longer the complicating factor that there is a lot of magic going on (is this settings a string or a list, what does default mean?) and I think that will benefit new users. See also https://github.com/ESMValGroup/ESMValCore/pull/1894#issuecomment-1428667217 for previous discussions on the topic.

Timeline for implementation

To set the expectations: this design is intended as a long-term strategy that can give guidance when making smaller improvements to the tool, not something that can immediately be implemented. Currently, no member of the @ESMValGroup/technical-lead-development-team has a funded proposal in which a large task like this could be taken on.

Ideas welcome

@ESMValGroup/esmvaltool-developmentteam If you have ideas how to make this better, please share them in a comment below or at one of the community meetings.

valeriupredoi commented 8 months ago

TODO: finish typing this

:rofl:

schlunma commented 8 months ago

Thank you very much Bouwe for writing this up, this looks really great! I am really looking forward to these changes ❤️

I have three comments:

In addition to the default settings shipped with ESMValTool, I think it should also be possible to consider system-wide configurations, for example in /etc/esmvaltool/. This would allow us to add default configurations for commonly used HPC systems like Levante, and users would not have to mess around with configuration on these machines.
What would happen if a specific setting is used in multiple configuration files? It's clear that user-defined options should overwrite system-wide configuration and the default configuration, but what about the case when a user (accidentally) has two files with conflicting settings? Would we raise an error? Would we allow that and simply use the "last" file encountered (e.g., after sorting them alphabetically)?
I think we should avoid allowing arbitrary strings in YAML keys. In my experience, this is a major source of confusion for new(er) users of the tool. For example, in the definition of preprocessors in the recipe, it can be really confusing to find out which keys are arbitrary and which are not:

preprocessors:  # NOT arbitrary
  regrid:  # arbitrary
    regrid:  # NOT arbitrary
      target_grid: 2x2  # NOT arbitrary
      scheme: linear  # NOT arbitrary

Same goes for the definition of diagnostic scripts, variables, etc. My suggestion would be to avoid patterns like

projects:
  CMIP6:
    data:
      CMIP6-local:
        type: esmvalcore.local.LocalDataSource  # this could be omitted for local data
        path: /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ
        dirname: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
        filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
      CMIP6-ESGF:
        path: ~/climate_data
        dirname: '{project}/{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
        filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'

and use something like

projects:
  - name: CMIP6
    data:
      - type: esmvalcore.local.LocalDataSource  # this could be omitted for local data
        path: /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ
        dirname: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
        filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
      - path: ~/climate_data
        dirname: '{project}/{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
        filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'

instead. So basically, keys can only have fixed (and preferably well-documented) values.

valeriupredoi commented 7 months ago

the general structure is marvellous! Like @schlunma points out, I think the devil's in the details: take for instance the basic, Joe the Scientist user config file: I would add two sections and tweak a couple things

[general]
output_dir: ~/esmvaltool_output
auxiliary_data_dir: ~/auxiliary_data
max_parallel_tasks: null
log_level: info
remove_preproc_dir: true
config_developer_file: null

[data]
rootpath: ~/climate_data
directory-structure-type:  # not DRS nor Data Reference Syntax - that is a terribly confusing, highly technical term
  CMIP3: ESGF
  CMIP5: ESGF
  CMIP6: ESGF
  CORDEX: ESGF
  obs4MIPs: ESGF

what I would also propose (and I can help with that!) is we create checklists in case of a number of issues: our configuration is becoming very complex - in aviation, where a plane has a very complex configuration, they have checklists in case of faults: we could do that too: eg data is not found: check this config file and that config file, these values for those keys etc.

tepmo42 commented 7 months ago

I like the idea of a check list V! Very useful to cut down the time it takes to spot what I've missed.

schlunma commented 1 week ago

Suggestion from @k-a-webb: Add a command line option to show the final merged configuration, e.g.

esmvaltool config show

This will make it easier to find potential problems arising from merging the different configuration files.

rbeucher commented 1 week ago

Flagging @charles-turner-1

ESMValGroup / ESMValCore