There are multiple ways to organize catalogs, collections, items and assets. I think we need to agree on something to move forward.

Requirements:

Handle files containing one data variable and files with multiple data variables
Handle datasets where the same variable is split across multiple files (e.g. 10 years slices)
Handle netCDF files and opendap links
Handle Zarr objects (future-proof)
Simplify typical ensemble creation operations (all simulations that have x,y,z variables for a,b,c experiments)
Ensure search queries return significant results, and users are not drowned in results.

Options:

CMIP6 catalog / File item / Asset

All properties are at the File Item level, meaning Assets are just the various access endpoints. One simulation for a given model and experiment would be composed of multiple items (variables, periods). A typical search would return a large number of results.

CMIP6 catalog / Experiment collection / Model collection / Member collection / Variable collection / File item / Asset

Here we subdivide the catalog into multiple hierarchical collections. If we limit search results to collections, we'd be able to go down the hierarchy without being flooded by results (I assume). Aggregating the Items split by time periods would generate continuous time series.

Note that Collection IDs should be globally unique, meaning that the variable collection cannot simply be named tas, but would have to look something like cmip6_ssp370_canesm5_r1i1p2_tas. It is not clear how to deal with files that store multiple variables in this scheme, but the collection ID could be cmip6_ssp370_canesm5_r1i1p2_multi in those cases.

Unclear how search would work, since collection search is still at the proposal stage.

CMIP6 catalog / Experiment collection / Model collection / Member collection / Variable item / Asset

Only difference here with the previous option would be that for multiple time periods, we'd have only one time with multiple assets. This would mean that the start and end date would be asset properties, possibly messing with search functionality:

As detailed above, Items contain properties, which are the main source of metadata for searching across Items. Many content extensions can add further property fields as well. Any property that can be specified for an Item can also be specified for a specific asset. This can be used to override a property defined in the Item, or to specify fields for which there is no single value for all assets.

It is important to note that the STAC API does not facilitate searching across Asset properties in this way, and this should be used sparingly. It is primarily used to define properties at the Asset level that may be used during use of the data instead of for searching.

CMIP6 catalog / Simulation collection / File item / Asset

Here there is only one collection level that would indicate which files can be aggregated. The criteria would be for files to share the same spatial grid and calendar, and origin from the same climate model. A Simulation collection would include all experiments and realizations. Variables on the same grid would also be part of the same collection, but that would mean the same model would typically need at least two collections (atmos and ocean, which are on different grids).

I'm sure I missed a lot of potential issues, and I haven't yet done a review of other implementations to understand their organization. Most of the reading I've done on this had to do with Zarr datasets, and how to describe them within STAC.

crim-ca / stac-populator

STAC Catalog architecture #21

Options:

CMIP6 catalog / File item / Asset

CMIP6 catalog / Experiment collection / Model collection / Member collection / Variable collection / File item / Asset

CMIP6 catalog / Experiment collection / Model collection / Member collection / Variable item / Asset

CMIP6 catalog / Simulation collection / File item / Asset