NOAA-GFDL / CatalogBuilder

Toolset to build, validate, modify, and use intake-esm based data catalogs
3 stars 3 forks source link

ERA5 catalog progress + Python issue #36

Open meteorologist15 opened 2 months ago

meteorologist15 commented 2 months ago

The manual catalog for ERA5 data, coupled with the JSON generated by the CatalogBuilder, can be ingested by intake-esm, but only partially. The unmodified catalog contains the following data:

activity_id,institution_id,source_id,experiment_id,frequency,modeling_realm,table_id,member_id,variable_id,temporal_subset,chunk_freq,grid_label,platform,dimensions,cell_methods,path
ECMWF_Reanalysis_Phase_5,ECMWF,ECMWF_Reanalysis,Hourly_Data_On_Pressure_Levels,hourly,atmos,,1,specific_humidity,1940-2023,annual,,,longitude|latitude|time,,/uda/ERA5/Hourly_Data_On_Pressure_Levels/reanalysis/global/1000hPa/1hr-timestep/annual_file-range/specific_humidity/ERA5_1hr_specific_humidity_2023.nc
ECMWF_Reanalysis_Phase_5,ECMWF,ECMWF_Reanalysis,Monthly_Averaged_Data_On_Single_Levels,monthly,atmos,,1,10m_u_component_of_wind,1940-2023,annual,,,longitude|latitude|time,,/uda/ERA5/Monthly_Averaged_Data_On_Single_Levels/reanalysis/global/annual_file-range/Wind/u_10m/ERA5_monthly_averaged_10m_u_component_of_wind_2023.nc
ECMWF_Reanalysis_Phase_5,ECMWF,ECMWF_Reanalysis,Monthly_Averaged_Data_On_Single_Levels,monthly,ocean,,1,peak_wave_period,1940-2023,annual,,,longitude|latitude|time,,/uda/ERA5/Monthly_Averaged_Data_On_Single_Levels/reanalysis/global/annual_file-range/Ocean_waves/peak_wave_period/ERA5_monthly_averaged_peak_wave_period_2023.nc
ECMWF_Reanalysis_Phase_5,ECMWF,ECMWF_Reanalysis,Monthly_Averaged_Data_On_Single_Levels,monthly,atmos,,1,mean_surface_downward_short_wave_radiation_flux,1940-2023,annual,,,longitude|latitude|number|time,,/uda/ERA5/Monthly_Averaged_Data_On_Single_Levels/ensemble_members/global/annual_file-range/Mean_rates/mean_surface_downward_short-wave_rad_flux/ERA5_monthly_averaged_mean_surface_downward_short_wave_radiation_flux_2023.nc
ECMWF_Reanalysis_Phase_5,ECMWF,ECMWF_Reanalysis,Monthly_Averaged_Data_On_Single_Levels,monthly,atmos,,1,10m_wind_speed,1940-2023,annual,,,longitude|latitude|number|time,,/uda/ERA5/Monthly_Averaged_Data_On_Single_Levels/ensemble_members/global/annual_file-range/Wind/10m_wind_speed/ERA5_monthly_averaged_10m_wind_speed_2023.nc
ECMWF_Reanalysis_Phase_5,ECMWF,ECMWF_Reanalysis,Hourly_Data_On_Single_Levels,hourly,atmos,,1,mean_surface_downward_short_wave_radiation_flux,1940-2023,annual,,,longitude|latitude|time,,/uda/ERA5/Hourly_Data_On_Single_Levels/ensemble_mean/global/3hr-timestep/annual_file-range/Mean_rates/mean_surface_downward_short-wave_rad_flux/ERA5_3hr_mean_surface_downward_short_wave_radiation_flux_2023.nc
ECMWF_Reanalysis_Phase_5,ECMWF,ECMWF_Reanalysis,Monthly_Averaged_Data_On_Pressure_Levels,monthly,atmos,,1,fraction_of_cloud_cover,1940-2023,annual,,,longitude|latitude|level|time,,/uda/ERA5/Monthly_Averaged_Data_On_Pressure_Levels/reanalysis/global/all_levels/annual_file-range/cloud_cover_fraction/ERA5_monthly_averaged_fraction_of_cloud_cover_2022.nc
ECMWF_Reanalysis_Phase_5_Land,ECMWF,ECMWF_Reanalysis,ERA5-Land_Monthly_Averaged_Data,monthly,land,,1,lake_mix_layer_temperature,1950-2023,annual,,,longitude|latitude|time,,/uda/ERA5/ERA5-Land_Monthly_Averaged_Data/reanalysis/global/annual_file-range/Lakes/lake_mix-layer_temp/ERA5-Land_monthly_averaged_lake_mix_layer_temperature_2023.nc
ECMWF_Reanalysis_Phase_5_Extra,ECMWF,ECMWF_Reanalysis,ERA5_Extra,hourly,atmos,,1,updraught,1979-2023,monthly,,1,initial_time0_hours|forecast_time0|lv_HYBL0|lat_0|lon_0|ncl_strlen_0,,/uda/ERA5/ERA5_Extra/reanalysis/global/monthly_file-range/updraught/ERA5MARS_updraught_202012.nc4

The following is also run:

>>> data_catalog_3 = intake.open_esm_datastore("ERA5_initCatalog_slimmed.json")
/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/cat.py:269: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  self._df.sample(20, replace=True)
>>> data_catalog_3.df
                       activity_id institution_id         source_id  ...                                         dimensions cell_methods                                               path
0         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                            longitude|latitude|time          NaN  /uda/ERA5/Hourly_Data_On_Single_Levels/reanaly...
1         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                            longitude|latitude|time          NaN  /uda/ERA5/Hourly_Data_On_Pressure_Levels/reana...
2         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                            longitude|latitude|time          NaN  /uda/ERA5/Monthly_Averaged_Data_On_Single_Leve...
3         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                            longitude|latitude|time          NaN  /uda/ERA5/Monthly_Averaged_Data_On_Single_Leve...
4         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                     longitude|latitude|number|time          NaN  /uda/ERA5/Monthly_Averaged_Data_On_Single_Leve...
5         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                     longitude|latitude|number|time          NaN  /uda/ERA5/Monthly_Averaged_Data_On_Single_Leve...
6         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                            longitude|latitude|time          NaN  /uda/ERA5/Hourly_Data_On_Single_Levels/ensembl...
7         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                            longitude|latitude|time          NaN  /uda/ERA5/Hourly_Data_On_Single_Levels/reanaly...
8         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                      longitude|latitude|level|time          NaN  /uda/ERA5/Monthly_Averaged_Data_On_Pressure_Le...
9    ECMWF_Reanalysis_Phase_5_Land          ECMWF  ECMWF_Reanalysis  ...                            longitude|latitude|time          NaN  /uda/ERA5/ERA5-Land_Monthly_Averaged_Data/rean...
10  ECMWF_Reanalysis_Phase_5_Extra          ECMWF  ECMWF_Reanalysis  ...  initial_time0_hours|forecast_time0|lv_HYBL0|la...          NaN  /uda/ERA5/ERA5_Extra/reanalysis/global/monthly...

[11 rows x 16 columns]

The following execution/error results:

>>> dsets_3 = data_catalog_3.to_dataset_dict()

--> The keys in the returned dictionary of datasets are constructed as follows:
        'source_id.experiment_id.frequency.modeling_realm.member_id.chunk_freq'
/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/indexing.py:1452: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  value = value[(slice(None),) * axis + (subkey,)]
/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/indexing.py:1452: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  value = value[(slice(None),) * axis + (subkey,)]
/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/indexing.py:1452: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  value = value[(slice(None),) * axis + (subkey,)]
Traceback (most recent call last):
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/source.py", line 259, in _open_dataset
    self._ds = xr.combine_by_coords(datasets, **self.xarray_combine_by_coords_kwargs)
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/combine.py", line 958, in combine_by_coords
    concatenated_grouped_by_data_vars = tuple(
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/combine.py", line 959, in <genexpr>
    _combine_single_variable_hypercube(
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/combine.py", line 630, in _combine_single_variable_hypercube
    concatenated = _combine_nd(
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/combine.py", line 232, in _combine_nd
    combined_ids = _combine_all_along_first_dim(
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/combine.py", line 267, in _combine_all_along_first_dim
    new_combined_ids[new_id] = _combine_1d(
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/combine.py", line 290, in _combine_1d
    combined = concat(
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/concat.py", line 252, in concat
    return _dataset_concat(
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/concat.py", line 597, in _dataset_concat
    raise ValueError(
ValueError: coordinate 't2m' not present in all datasets.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/pydantic/deprecated/decorator.py", line 55, in wrapper_function
    return vd.call(*args, **kwargs)
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/pydantic/deprecated/decorator.py", line 150, in call
    return self.execute(m)
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/pydantic/deprecated/decorator.py", line 222, in execute
    return self.raw_function(**d, **var_kwargs)
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/core.py", line 686, in to_dataset_dict
    raise exc
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/core.py", line 682, in to_dataset_dict
    key, ds = task.result()
  File "/net2/ker/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/net2/ker/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/net2/ker/anaconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/core.py", line 824, in _load_source
    return key, source.to_dask()
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/source.py", line 272, in to_dask
    self._load_metadata()
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake/source/base.py", line 283, in _load_metadata
    self._schema = self._get_schema()
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/source.py", line 208, in _get_schema
    self._open_dataset()
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/source.py", line 264, in _open_dataset
    raise ESMDataSourceError(
intake_esm.source.ESMDataSourceError: Failed to load dataset with key='ECMWF_Reanalysis.Hourly_Data_On_Single_Levels.hourly.atmos.1.annual'
                 You can use `cat['ECMWF_Reanalysis.Hourly_Data_On_Single_Levels.hourly.atmos.1.annual'].df` to inspect the assets/files for this key.

After removing the offending datasets (in this case, the files containing t2m (2-meter temperature) and blh (boundary layer height)), I am able to successfully generate output from the "to_dataset_dict()" method. Example below:

{'ECMWF_Reanalysis.Monthly_Averaged_Data_On_Single_Levels.monthly.ocean.1.annual': <xarray.Dataset>
Dimensions:    (longitude: 720, latitude: 361, time: 12)
Coordinates:
  * longitude  (longitude) float32 0.0 0.5 1.0 1.5 ... 358.0 358.5 359.0 359.5
  * latitude   (latitude) float32 90.0 89.5 89.0 88.5 ... -89.0 -89.5 -90.0
  * time       (time) datetime64[ns] 2023-01-01 2023-02-01 ... 2023-12-01
Data variables:
    pp1d       (time, latitude, longitude) float32 dask.array<chunksize=(12, 361, 720), meta=np.ndarray>
Attributes: (12/17)
    Conventions:                       CF-1.6
    history:                           2024-05-01 21:23:52 GMT by grib_to_net...
    intake_esm_vars:                   ['peak_wave_period']
    intake_esm_attrs:activity_id:      ECMWF_Reanalysis_Phase_5
    intake_esm_attrs:institution_id:   ECMWF
    intake_esm_attrs:source_id:        ECMWF_Reanalysis
    ...                                ...
    intake_esm_attrs:temporal_subset:  1940-2023
    intake_esm_attrs:chunk_freq:       annual
    intake_esm_attrs:dimensions:       longitude|latitude|time
    intake_esm_attrs:path:             /uda/ERA5/Monthly_Averaged_Data_On_Sin...
    intake_esm_attrs:_data_format_:    netcdf
    intake_esm_dataset_key:            ECMWF_Reanalysis.Monthly_Averaged_Data..., 'ECMWF_Reanalysis.Hourly_Data_On_Pressure_Levels.hourly.atmos.1.annual': <xarray.Dataset>
Dimensions:    (longitude: 1440, latitude: 721, time: 8760)
Coordinates:
  * longitude  (longitude) float32 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8
  * latitude   (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
  * time       (time) datetime64[ns] 2023-01-01 ... 2023-12-31T23:00:00
Data variables:
    q          (time, latitude, longitude) float32 dask.array<chunksize=(8760, 721, 1440), meta=np.ndarray>
Attributes: (12/17)
    Conventions:                       CF-1.6
    history:                           2024-04-13 15:24:29 GMT by grib_to_net...
    intake_esm_vars:                   ['specific_humidity']
    intake_esm_attrs:activity_id:      ECMWF_Reanalysis_Phase_5
    intake_esm_attrs:institution_id:   ECMWF
    intake_esm_attrs:source_id:        ECMWF_Reanalysis
    ...                                ...
    intake_esm_attrs:temporal_subset:  1940-2023
    intake_esm_attrs:chunk_freq:       annual
    intake_esm_attrs:dimensions:       longitude|latitude|time
    intake_esm_attrs:path:             /uda/ERA5/Hourly_Data_On_Pressure_Leve...
    intake_esm_attrs:_data_format_:    netcdf
    intake_esm_dataset_key:            ECMWF_Reanalysis.Hourly_Data_On_Pressu..., 'ECMWF_Reanalysis.Monthly_Averaged_Data_On_Single_Levels.monthly.atmos.1.annual': <xarray.Dataset>
Dimensions:    (longitude: 1440, latitude: 721, time: 12)
Coordinates:
  * longitude  (longitude) float32 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8
  * latitude   (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
  * time       (time) datetime64[ns] 2023-01-01 2023-02-01 ... 2023-12-01
Data variables:
    u10        (time, latitude, longitude) float32 dask.array<chunksize=(12, 721, 1440), meta=np.ndarray>
Attributes: (12/17)
    Conventions:                       CF-1.6
    history:                           2024-05-03 23:14:39 GMT by grib_to_net...
    intake_esm_vars:                   ['10m_u_component_of_wind']
    intake_esm_attrs:activity_id:      ECMWF_Reanalysis_Phase_5
    intake_esm_attrs:institution_id:   ECMWF
    intake_esm_attrs:source_id:        ECMWF_Reanalysis
    ...                                ...
    intake_esm_attrs:temporal_subset:  1940-2023
    intake_esm_attrs:chunk_freq:       annual
    intake_esm_attrs:dimensions:       longitude|latitude|time
    intake_esm_attrs:path:             /uda/ERA5/Monthly_Averaged_Data_On_Sin...
    intake_esm_attrs:_data_format_:    netcdf
    intake_esm_dataset_key:            ECMWF_Reanalysis.Monthly_Averaged_Data..., 'ECMWF_Reanalysis.Hourly_Data_On_Single_Levels.hourly.atmos.1.annual': <xarray.Dataset>

Path to unmodified catalog (CSV): /nbhome/Kristopher.Rand/uda/catalogs/ERA5_initCatalog_slimmed.csv Path to unmodified catalog's associated JSON: /nbhome/Kristopher.Rand/uda/catalogs/ERA5_initCatalog_slimmed.json

Path to modified catalog (CSV): /nbhome/Kristopher.Rand/uda/catalogs/ERA5_initCatalog_slimmed_modified.csv Path to modified catalog's associated JSON: /nbhome/Kristopher.Rand/uda/catalogs/ERA5_initCatalog_slimmed_modified.json

aradhakrishnanGFDL commented 2 months ago

Thanks @meteorologist15! This helps to see how we can use the catalog builder to generate the modified csv, as discussed.

aradhakrishnanGFDL commented 1 week ago

TODO: open new issues for the dev and testing with catalog builder

meteorologist15 commented 1 week ago

Catalog example generated with the Catalog Builder for ERA5 dataset (pressure levels, geopotential variable, 300 hPa:

activity_id,institution_id,source_id,experiment_id,frequency,realm,table_id,member_id,grid_label,variable_id,time_range,chunk_freq,grid_label,platform,dimensions,cell_methods,path
,,,Hourly_Data_On_Pressure_Levels,,,,,,geopotential,,,,,,,/uda/ERA5/Hourly_Data_On_Pressure_Levels/reanalysis/global/300hPa/6hr-timestep/annual_file-range/geopotential/ERA5_6hr_geopotential_1940.nc
,,,Hourly_Data_On_Pressure_Levels,,,,,,geopotential,,,,,,,/uda/ERA5/Hourly_Data_On_Pressure_Levels/reanalysis/global/300hPa/6hr-timestep/annual_file-range/geopotential/ERA5_6hr_geopotential_1941.nc
,,,Hourly_Data_On_Pressure_Levels,,,,,,geopotential,,,,,,,/uda/ERA5/Hourly_Data_On_Pressure_Levels/reanalysis/global/300hPa/6hr-timestep/annual_file-range/geopotential/ERA5_6hr_geopotential_1942.nc
...etc

The categories preserved are experiment_id, variable_id, and path

The configuration used:

headerlist: ["activity_id", "institution_id", "source_id", "experiment_id",
                  "frequency", "realm", "table_id",
                  "member_id", "grid_label", "variable_id",
                  "time_range", "chunk_freq","grid_label","platform","dimensions","cell_methods","path"]

output_path_template: ['NA', 'NA', 'experiment_id', 'NA', 'NA', 'NA', 'NA', 'NA', 'variable_id']

output_file_template: ['NA', 'NA', 'variable_id', 'NA']

input_path: "/uda/ERA5/Hourly_Data_On_Pressure_Levels/reanalysis/global/300hPa/6hr-timestep/annual_file-range/geopotential/"

output_path: "/nbhome/Kristopher.Rand/uda/catalogs/test_catalogbuilder"
aradhakrishnanGFDL commented 1 week ago

@meteorologist15 I’m trying to run this. Are you using the main branch from this repository?

meteorologist15 commented 5 days ago

Locally committed small change to gfdlcrawler to account for filenames in without a "." in its name. Awaiting to further commit to branch on github.

meteorologist15 commented 5 days ago

Two separate issues exist: 1) Filenames with multiple word variable names, separated by an underscore -- if the "" character in filenames is to be checked. 2) If using "" as a separator, properly capturing/resolving "monthly_averaged" in the filenames of monthly averaged datasets. Some more fundamental changes to the crawler script may be necessary. 3. Variable names in the path that differ from the filename.

aradhakrishnanGFDL commented 5 days ago

Locally committed small change to gfdlcrawler to account for filenames in without a "." in its name. Awaiting to further commit to branch on github.

Great. Thanks. You may use this as reference. But also the fastest approach not the perfect approach is good for now. https://docs.google.com/document/d/17nlIgSQPwL1MFqwHlRV8R5vCpug08r71tM75poGpQtc/edit#heading=h.60aeh5dnv42m