ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.
https://www.esmvaltool.org
Apache License 2.0
42 stars 38 forks source link

How to select the latest version of a dataset on ESGF? #1523

Open bouweandela opened 2 years ago

bouweandela commented 2 years ago

It looks like mixing of two different versions of a dataset is a problem with version selection in the esmvalcore.esgf module. This occurs if different versions contain differently named files.

>>> from esmvalcore.esgf import find_files
>>> files = find_files(project="CMIP5", mip="Amon", short_name="tas", dataset='EC-EARTH', exp=['historical', 'rcp85'], ensemble='r6i1p1')
>>> for file in files:
...     print(file)
... 
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/historical/mon/atmos/Amon/r6i1p1/v20130315/tas_Amon_EC-EARTH_historical_r6i1p1_190001-194912.nc on hosts ['aims3.llnl.gov', 'esgf-data1.ceda.ac.uk', 'esgf.nci.org.au', 'esgf2.dkrz.de']
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/historical/mon/atmos/Amon/r6i1p1/v20130315/tas_Amon_EC-EARTH_historical_r6i1p1_195001-200512.nc on hosts ['aims3.llnl.gov', 'esgf-data1.ceda.ac.uk', 'esgf.nci.org.au', 'esgf2.dkrz.de']
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/rcp85/mon/atmos/Amon/r6i1p1/v20130315/tas_Amon_EC-EARTH_rcp85_r6i1p1_200601-200912.nc on hosts ['aims3.llnl.gov']
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/rcp85/mon/atmos/Amon/r6i1p1/v20171115/tas_Amon_EC-EARTH_rcp85_r6i1p1_200601-205012.nc on hosts ['esgf.ichec.ie', 'esgf.nci.org.au']
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/rcp85/mon/atmos/Amon/r6i1p1/v20130315/tas_Amon_EC-EARTH_rcp85_r6i1p1_201001-201912.nc on hosts ['aims3.llnl.gov']
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/rcp85/mon/atmos/Amon/r6i1p1/v20130315/tas_Amon_EC-EARTH_rcp85_r6i1p1_202001-202912.nc on hosts ['aims3.llnl.gov']
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/rcp85/mon/atmos/Amon/r6i1p1/v20130315/tas_Amon_EC-EARTH_rcp85_r6i1p1_203001-203912.nc on hosts ['aims3.llnl.gov']
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/rcp85/mon/atmos/Amon/r6i1p1/v20130315/tas_Amon_EC-EARTH_rcp85_r6i1p1_204001-204912.nc on hosts ['aims3.llnl.gov']
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/rcp85/mon/atmos/Amon/r6i1p1/v20130315/tas_Amon_EC-EARTH_rcp85_r6i1p1_205001-205912.nc on hosts ['aims3.llnl.gov']
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/rcp85/mon/atmos/Amon/r6i1p1/v20171115/tas_Amon_EC-EARTH_rcp85_r6i1p1_205101-210012.nc on hosts ['esgf.ichec.ie', 'esgf.nci.org.au']
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/rcp85/mon/atmos/Amon/r6i1p1/v20130315/tas_Amon_EC-EARTH_rcp85_r6i1p1_206001-206912.nc on hosts ['aims3.llnl.gov']
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/rcp85/mon/atmos/Amon/r6i1p1/v20130315/tas_Amon_EC-EARTH_rcp85_r6i1p1_207001-207912.nc on hosts ['aims3.llnl.gov']
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/rcp85/mon/atmos/Amon/r6i1p1/v20130315/tas_Amon_EC-EARTH_rcp85_r6i1p1_208001-208912.nc on hosts ['aims3.llnl.gov']
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/rcp85/mon/atmos/Amon/r6i1p1/v20130315/tas_Amon_EC-EARTH_rcp85_r6i1p1_209001-209912.nc on hosts ['aims3.llnl.gov']
ESGFFile:cmip5/output1/ICHEC/EC-EARTH/rcp85/mon/atmos/Amon/r6i1p1/v20130315/tas_Amon_EC-EARTH_rcp85_r6i1p1_210001-210012.nc on hosts ['aims3.llnl.gov']
>>> 

Specifically, this code https://github.com/ESMValGroup/ESMValCore/blob/c4696b4db16e61ff3d3a2c825e817e00e841cd0a/esmvalcore/esgf/_search.py#L44-L63 first creates a list of all available files and then selects the latest version of each file, because it was reported in https://github.com/ESMValGroup/ESMValCore/issues/286 that not every file is present in every version. Maybe we should consider making the code for selecting the most recent version of the data more advanced and have a second pass over the available data looking at the temporal coverage?

Originally posted by @bouweandela in https://github.com/ESMValGroup/ESMValTool/pull/2563#issuecomment-1059027527

zklaus commented 2 years ago

Have you tried adding the latest=True facet? I have used that in the past for other API searches following the instructions here.

bouweandela commented 2 years ago

Yes: https://github.com/ESMValGroup/ESMValCore/blob/a162a612fc1bd5d7ec88ee99f737860fd29407da/esmvalcore/esgf/_search.py#L84

The problem is that that doesn't work, because some datasets are incorrectly labeled as latest on ESGF.