aradhakrishnanGFDL / CatalogBuilder

CatalogBuilder for data discovery and analysis
3 stars 4 forks source link

add vocabulary schema constraints for frequency and chunk_freq #127

Closed ceblanton closed 1 month ago

ceblanton commented 1 month ago

Most vocabulary is unconstrained, meaning any values is acceptable.

But some vocabulary makes more sense to be constrained. frequency and chunk_freq, currently included as "aggregate columns" in the GFDL schema (cats/gfdl_template.json), should be constrained to certain known values.

For frequency we plan to extend as needed to the community (CMIP6) vocabulary:

https://raw.githubusercontent.com/NOAA-GFDL/CMIP6_CVs/master/CMIP6_frequency.json

chunk_freq is a little less clear, but currently describes the amount of time in each file; e.g. 1yr, 2yr,5yr, and so on.

The chunk_freq values don't exist yet, but we could add this file to our space any time:

https://raw.githubusercontent.com/NOAA-GFDL/CMIP6_CVs/master/GFDL_chunk_freq.json

aradhakrishnanGFDL commented 1 month ago

Looks good. To clarify, as of now, the frequency values do not align with the CMIP6 frequency tables for the GFDL bronx-PP (canopy-symlink) output, but eventually they will?

ceblanton commented 1 month ago

It is a little mismatched currently. How about one of these?

  1. Change the canopy pp directory to be mon.
  2. Change the canopy pp directory to be monthly, and make the catalog builder "smarter" to detect this and set the frequency to mon. This would be OK; the filepath would be monthly but the vocabulary would be mon.
  3. Change the canopy pp directory to be monthly, and change the CMIP6_frequency.json to be monthly. (This is bad, I think, as we're moving away from the community standard)
  4. Change the canopy pp directory to be monthly, and add monthly as well as mon. (This could be bad to have multiple valid definitions for monthly.)
aradhakrishnanGFDL commented 1 month ago

[1] seems the cleanest!

ceblanton commented 1 month ago

Couldn't agree more. The shortest option by characters is clearly the cleanest to explain and code, so let's go that way.