ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.
https://www.esmvaltool.org
Apache License 2.0
42 stars 38 forks source link

Add support for the CREATE-IP project #1652

Open bouweandela opened 2 years ago

bouweandela commented 2 years ago

The CREATE-IP project is the follow up to the ana4MIPs project. I think it would be useful to add support for this in the ESMValCore, as this would allow e.g. automatically downloading reanalysis datasets that require no further CMORization.

Example integration of CREATE-IP in ESMValCore

config-developer.yml entry could look something like this?

'CREATE-IP':
  cmor_strict: false
  input_dir:
    default: '{project}/{product}/{dataset}/{realm}/{frequency}/{latestversion}'
    ESGF: '{project}/{product}/{dataset}/{realm}/{frequency}/{latestversion}'
  input_file:
    default: '{short_name}_{mip}_{product}*.nc'
    ESGF: '{short_name}_{mip}_{product}*.nc'
  output_file: '{project}_{product}_{dataset}_{frequency}_{short_name}'
  cmor_type: 'CMIP5'

but may need to add also model?

A difficulty is that there are apparently 3 different DRS entries used in this project:

{
    '%(root)s/%(project)s/%(product)s/%(institute)s/%(source_id)s/%(experiment)s/%(time_frequency)s/%(realm)s/%(variable)s': 7,
    '%(root)s/%(project)s/%(product)s/%(institute)s/%(model)s/%(source_id)s/%(time_frequency)s/%(realm)s/%(variable)s': 23,
    '%(root)s/%(project)s/%(product)s/%(institute)s/%(model)s/%(experiment)s/%(time_frequency)s/%(realm)s/%(variable)s': 80,
}

Data finding

from esmvalcore.esgf import find_files
from esmvalcore.esgf.facets import FACETS, DATASET_MAP

FACETS['CREATE-IP'] = {
    'dataset': 'source_id',
    'frequency': 'time_frequency',
    'model': 'model',
    'product': 'product',
    'realm': 'realm',
    'short_name': 'variable',
}
DATASET_MAP['CREATE-IP'] = {}

find_files(project='CREATE-IP', short_name='tas', dataset='CREATE-MRE', frequency='mon', model='JRA-55')
# Result:
# [ESGFFile:CREATE-IP/MREreanalysis/JMA/JRA-55/CREATE-MRE/atmos/mon/v20200609/tas_Amon_MREreanalysis_JRA-55_198001-201512.nc on hosts ['esgf.nccs.nasa.gov']]

find_files(project='CREATE-IP', short_name='tas', dataset='MERRA2', frequency='mon', product='MRE2reanalysis')
# Result
# [ESGFFile:CREATE-IP/MRE2reanalysis/NASA-GMAO/GEOS-5/MERRA2/atmos/mon/v20200613/tas_Amon_MRE2reanalysis_MERRA2_198001-201712.nc on hosts ['esgf.nccs.nasa.gov']]

find_files(project='CREATE-IP', short_name='snd', dataset='CFSR', realm='landIce')
# Result:
# [ESGFFile:CREATE-IP/reanalysis/NOAA-NCEP/CFSR/landIce/mon/v20200607/snd_LImon_reanalysis_CFSR_197901-201912.nc on hosts ['esgf.nccs.nasa.gov']]
valeriupredoi commented 2 years ago

yis, let's!

A difficulty is that there are apparently 3 different DRS entries used in this project:

what are the numbers (values in the dict) denoting? Also, I'd imagine source_id is equivalent to model? That'd make it two DRS's - actually one coz we can map source_id to model (somehow)

bouweandela commented 2 years ago

what are the numbers (values in the dict) denoting?

I think it is the number of records on ESGF with the same dataset_id that use that DRS. Here is the code to get this info with esgf-pyclient:

from pyesgf.search import SearchConnection
conn = SearchConnection('https://esgf-data.dkrz.de/esg-search',
                        distrib=True)

ctx = conn.new_context(project='CREATE-IP', facets='directory_format_template_')
dict(ctx.facet_counts)['directory_format_template_']

Also, I'd imagine source_id is equivalent to model?

It seems this can be different. For example, the ERA5 dataset is produced using the CY41R2 IFS model, if I understand it correctly.