Open meteorologist15 opened 2 months ago
Thanks @meteorologist15! This helps to see how we can use the catalog builder to generate the modified csv, as discussed.
TODO: open new issues for the dev and testing with catalog builder
Catalog example generated with the Catalog Builder for ERA5 dataset (pressure levels, geopotential variable, 300 hPa:
activity_id,institution_id,source_id,experiment_id,frequency,realm,table_id,member_id,grid_label,variable_id,time_range,chunk_freq,grid_label,platform,dimensions,cell_methods,path
,,,Hourly_Data_On_Pressure_Levels,,,,,,geopotential,,,,,,,/uda/ERA5/Hourly_Data_On_Pressure_Levels/reanalysis/global/300hPa/6hr-timestep/annual_file-range/geopotential/ERA5_6hr_geopotential_1940.nc
,,,Hourly_Data_On_Pressure_Levels,,,,,,geopotential,,,,,,,/uda/ERA5/Hourly_Data_On_Pressure_Levels/reanalysis/global/300hPa/6hr-timestep/annual_file-range/geopotential/ERA5_6hr_geopotential_1941.nc
,,,Hourly_Data_On_Pressure_Levels,,,,,,geopotential,,,,,,,/uda/ERA5/Hourly_Data_On_Pressure_Levels/reanalysis/global/300hPa/6hr-timestep/annual_file-range/geopotential/ERA5_6hr_geopotential_1942.nc
...etc
The categories preserved are experiment_id, variable_id, and path
The configuration used:
headerlist: ["activity_id", "institution_id", "source_id", "experiment_id",
"frequency", "realm", "table_id",
"member_id", "grid_label", "variable_id",
"time_range", "chunk_freq","grid_label","platform","dimensions","cell_methods","path"]
output_path_template: ['NA', 'NA', 'experiment_id', 'NA', 'NA', 'NA', 'NA', 'NA', 'variable_id']
output_file_template: ['NA', 'NA', 'variable_id', 'NA']
input_path: "/uda/ERA5/Hourly_Data_On_Pressure_Levels/reanalysis/global/300hPa/6hr-timestep/annual_file-range/geopotential/"
output_path: "/nbhome/Kristopher.Rand/uda/catalogs/test_catalogbuilder"
@meteorologist15 I’m trying to run this. Are you using the main branch from this repository?
Locally committed small change to gfdlcrawler to account for filenames in without a "." in its name. Awaiting to further commit to branch on github.
Two separate issues exist: 1) Filenames with multiple word variable names, separated by an underscore -- if the "" character in filenames is to be checked. 2) If using "" as a separator, properly capturing/resolving "monthly_averaged" in the filenames of monthly averaged datasets. Some more fundamental changes to the crawler script may be necessary. 3. Variable names in the path that differ from the filename.
Locally committed small change to gfdlcrawler to account for filenames in without a "." in its name. Awaiting to further commit to branch on github.
Great. Thanks. You may use this as reference. But also the fastest approach not the perfect approach is good for now. https://docs.google.com/document/d/17nlIgSQPwL1MFqwHlRV8R5vCpug08r71tM75poGpQtc/edit#heading=h.60aeh5dnv42m
The manual catalog for ERA5 data, coupled with the JSON generated by the CatalogBuilder, can be ingested by intake-esm, but only partially. The unmodified catalog contains the following data:
The following is also run:
The following execution/error results:
After removing the offending datasets (in this case, the files containing t2m (2-meter temperature) and blh (boundary layer height)), I am able to successfully generate output from the "to_dataset_dict()" method. Example below:
Path to unmodified catalog (CSV): /nbhome/Kristopher.Rand/uda/catalogs/ERA5_initCatalog_slimmed.csv Path to unmodified catalog's associated JSON: /nbhome/Kristopher.Rand/uda/catalogs/ERA5_initCatalog_slimmed.json
Path to modified catalog (CSV): /nbhome/Kristopher.Rand/uda/catalogs/ERA5_initCatalog_slimmed_modified.csv Path to modified catalog's associated JSON: /nbhome/Kristopher.Rand/uda/catalogs/ERA5_initCatalog_slimmed_modified.json