ECMWFCode4Earth / challenges_2024

Discover the ECMWF Code for Earth 2024 challenges
46 stars 4 forks source link

Challenge 31 - Advance user capabilities to handle data constraints when using CDSAPI #9

Open RubenRT7 opened 4 months ago

RubenRT7 commented 4 months ago

Challenge 31 - Advance user capabilities to handle data constraints when using CDSAPI

Stream 3 - Software Development for Earth Sciences applications

Goal

Create a python library that will allow users to embed additional intelligence onto their scripts to handle CDS Dataset constraints improving the accuracy of submitted requests via cds-api.

Mentors and skills


Challenge description

Problem: Currently constraints are just functional to users when using the web interactive download form. Constraints manage the availability of different combinations when user is filling the form, guiding users towards requests which are valid by activating or deactivating available options in the widgets. These constraints are exposed via cdsapi but hidden to users and not documented. Because of that CDS process many requests from users which are wrong in scope and finally fail. This is not good for the users, neither for the system.

Data/System to be used: To do this challenge, it is only required a Python development environment, and account on CDS (https://cds.climate.copernicus.eu/) and the cdsapi (https://cds.climate.copernicus.eu/api-how-to).

Solution: A python library that is able to access the constraints definition for a given dataset via CDSAPI, and decoded it on the client side allowing user to perform different actions:

Ideas for implementation: these have been introduced on previous paragraphs. Mentors will help to configure their accounts and cdsapi, understand the constraints definition file (json), facilitate the understanding of the system, provide guide on datasets and polish the functional scope of requirements.

Resulting libraries will be put on the hands of cdsapi users as to have broader visibility on the real availability of data allowing more accuracy on the submitted requests. On one hand this will benefit user efficiency accessing the system and in the other will reduce unnecessary traffic of requests to the system. This feature will extend the capabilities of the new CDS Engine and API.

cataalbu commented 2 months ago

Hi! What do the constraints look like? Also, can two datasets from the same family (like ERA5, for example) have different constraints?

ecmwf-cobarzan commented 2 months ago

Hi Catalin,

What do the constraints look like?

Constraints are represented as JSON files. In principle, a constraints file (one per dataset) is a list of dictionaries having:

Here is an example (for use in the context of this challenge only):

[
{"source": ["anthropogenic"], "version": ["latest", "v4.2"], "variable": ["acetylene", "acids", "alcohols", "ammonia", "benzene", "black_carbon", "butanes", "carbon_dioxide", "carbon_dioxide_excl_short_cycle", "carbon_monoxide", "chlorinated_hydrocarbons", "esters", "ethane", "ethene", "ethers", "formaldehyde", "hexanes", "isoprene", "ketones", "methane", "monoterpenes", "nitrogen_oxides", "non_methane_vocs", "organic_carbon", "other_aldehydes", "other_alkenes_alkynes", "other_aromatics", "other_vocs", "pentanes", "propane", "propene", "sulphur_dioxide", "toluene", "trimethylbenzenes", "xylenes"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020"]},
{"source": ["anthropogenic"], "version": ["v2.1"], "variable": ["acetylene", "acids", "alcohols", "ammonia", "benzene", "black_carbon", "butanes", "carbon_dioxide", "carbon_monoxide", "chlorinated_hydrocarbons", "esters", "ethane", "ethene", "ethers", "formaldehyde", "hexanes", "isoprene", "ketones", "methane", "monoterpenes", "nitrogen_oxides", "non_methane_vocs", "organic_carbon", "other_aldehydes", "other_alkenes_alkynes", "other_aromatics", "other_vocs", "pentanes", "propane", "propene", "sulphur_dioxide", "toluene", "trimethylbenzenes", "xylenes"], "year": ["2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019"]},
{"source": ["anthropogenic"], "version": ["v2.1"], "variable": ["acetylene", "acids", "ammonia", "benzene", "black_carbon", "butanes", "carbon_dioxide", "carbon_monoxide", "chlorinated_hydrocarbons", "esters", "ethane", "ethene", "ethers", "formaldehyde", "hexanes", "isoprene", "ketones", "methane", "monoterpenes", "nitrogen_oxides", "non_methane_vocs", "organic_carbon", "other_aldehydes", "other_alkenes_alkynes", "other_aromatics", "other_vocs", "pentanes", "propane", "propene", "sulphur_dioxide", "toluene", "trimethylbenzenes", "xylenes"], "year": ["2003", "2004", "2005", "2006", "2007", "2008", "2009"]},
{"source": ["aviation"], "version": ["latest", "v1.1"], "variable": ["acetylene", "alcohols", "ammonia", "benzene", "black_carbon", "carbon_dioxide", "carbon_monoxide", "ethane", "ethene", "formaldehyde", "hexanes", "ketones", "nitrogen_oxides", "non_methane_vocs", "organic_carbon", "other_aldehydes", "other_alkenes_alkynes", "other_aromatics", "other_vocs", "pentanes", "propane", "propene", "sulphur_dioxide", "toluene", "trimethylbenzenes", "xylenes"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020"]},
{"source": ["biogenic"], "version": ["latest", "v3.0", "v3.1"], "variable": ["acetaldehyde", "acetic_acid", "acetone", "alpha_pinene", "beta_pinene", "butanes_and_higher_alkanes", "butenes_and_higher_alkenes", "carbon_monoxide", "ethane", "ethanol", "ethene", "formaldehyde", "formic_acid", "hydrogen_cyanide", "isoprene", "methane", "methanol", "methyl_bromide", "methyl_chloride", "methyl_iodide", "other_aldehydes", "other_ketones", "other_monoterpenes", "propane", "propene", "sesquiterpenes", "toluene"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019"]},
{"source": ["biogenic"], "version": ["v1.1"], "variable": ["acetaldehyde", "acetic_acid", "acetone", "butanes_and_higher_alkanes", "butenes_and_higher_alkenes", "carbon_monoxide", "ethane", "ethanol", "ethene", "formaldehyde", "formic_acid", "hydrogen_cyanide", "isoprene", "methane", "methanol", "other_aldehydes", "other_ketones", "other_monoterpenes", "pinene", "propane", "propene", "sesquiterpenes", "toluene"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015"]},
{"source": ["biogenic"], "version": ["v1.2"], "variable": ["acetaldehyde", "acetic_acid", "acetone", "butanes_and_higher_alkanes", "butenes_and_higher_alkenes", "carbon_monoxide", "ethane", "ethanol", "ethene", "formaldehyde", "formic_acid", "hydrogen_cyanide", "isoprene", "methane", "methanol", "other_aldehydes", "other_ketones", "other_monoterpenes", "pinene", "propane", "propene", "sesquiterpenes", "toluene"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019"]},
{"source": ["oceanic"], "version": ["latest", "v3.1"], "variable": ["bromoform", "dibromomethane", "dimethyl_sulphide", "iodomethane"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019"]},
{"source": ["oceanic"], "version": ["v2.1"], "variable": ["bromoform", "dibromomethane", "dimethyl_sulphide", "iodomethane"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018"]},
{"source": ["shipping"], "version": ["latest", "v2.1"], "variable": ["ash", "carbon_dioxide", "carbon_monoxide", "elemental_carbon", "nitrogen_oxides", "organic_carbon", "sulphate", "sulphur_oxides", "vocs_all"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018"]},
{"source": ["soil"], "version": ["latest", "v2.2"], "variable": ["nitrogen_oxides"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018"]},
{"source": ["soil"], "version": ["v1.1"], "variable": ["nitrogen_oxides"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015"]},
{"source": ["termites"], "version": ["latest", "v1.1"], "variable": ["methane"], "year": ["2000"]}
]

corresponding to this dataset. Beware constraints can evolve in time though.

Each such dictionary (i.e. a constraint) represents a complete data cube, i.e. all possible combinations of widget values in it correspond to existing data granules. The list of all constraints covers the universe of available data for a dataset.

Take for example the last constraint in the example above (i.e. last dictionary). Source, version, variable and year are the widget names/dimensions. The available data granules are (termites, latest, methane, 2000) and (termites, v1.1, methane, 2000).

Can two datasets from the same family (like ERA5, for example) have different constraints?

Yes. And that is typically the case, i.e. one constraint file per dataset. They can vary a lot:

If you have any other questions, please let us know. Thank you for your interest!

Have a nice day!

Petrut COBARZAN & the team

cataalbu commented 2 months ago

For example for this dataset https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-pressure-levels?tab=form The api request generated from the interface is this one:

import cdsapi

c = cdsapi.Client()

c.retrieve(
    'reanalysis-era5-pressure-levels',
    {
        'product_type': 'reanalysis',
        'format': 'grib',
        'time': '00:00',
        'day': [
            '29', '30',
        ],
        'month': [
            '01', '02',
        ],
        'year': '2024',
        'pressure_level': '50',
        'variable': [
            'divergence', 'geopotential',
        ],
    },
    'download.grib')

By - Automatise the definition of a valid set of requests before submission via api. you mean that we should break on the client-side this request in two, one for {..., 'day': ['29'], 'month': ['01', '02'], ...} and one for {..., 'day': ['30;], 'month': ['01'], ...}, as the 30th of February does not exist?

ecmwf-cobarzan commented 2 months ago

Yes. That is a cleverly constructed example (that can be inferred without knowing the specific constraints for this dataset). Very good!

The initial request would be broken into (at least) these two sub-requests, which might be themselves broken into more fine-grained sub-requests (if necessary, and not necessarily in this order). The ultimate objective is to determine (and then submit for execution) a set of sub-requests for which the entire corresponding data cube is available.

Ideally, the union of this set of sub-requests would be equal to the intersection between the client's initial request/selection and the set of constraints/available data cubes. Also, the set would be pairwise disjoint (so that no data granule is covered more than once). Ultimately, the set would be as small as possible (so that we perform the minimum possible number of requests). However, the size of each individual request should be (generally) small enough so that the CDS engine does not get clogged with large requests.

cataalbu commented 2 months ago

The implementation of the solution will be integrated into the cdsapi?

ecmwf-cobarzan commented 2 months ago

Yes (subject to the quality of the resulting solution, of course). Development could be carried in a fork of the repository or as a totally independent solution.

cataalbu commented 2 months ago

In this dataset, I think the "Pressure level" section is optional. How are these optional keys defined in a constraint?

ecmwf-cobarzan commented 2 months ago

In situations where the constraints/data cubes vary in terms of dimensionality, some widgets/dimensions are not required in certain selection combinations. In the example above, the pressure level is only relevant for multi-level variables. In such cases, the constraints concerning single-level variables would not contain the pressure level dimension, while the ones concerning the multi-level variables might.