NCAR / esm-collection-spec

Earth System Model Collection specification
Apache License 2.0
13 stars 7 forks source link

Roadmap for merging with STAC #21

Open rabernat opened 4 years ago

rabernat commented 4 years ago

This is a follow up to the discussion in https://github.com/radiantearth/stac-spec/issues/713#issuecomment-613992972.

On 2020-04-20, we had a call with myself, @jhamman, @cholmes, @m-mohr, and @matthewhanson. The aim was to make progress on something everyone wants: to merge esm collection spec with STAC. That was our intention from the beginning, but we chose to fork temporarily to get something working fast.

The goal for now is to do as minimal changes as possible to make this work. My recollection of the meeting is that there are two steps to the proposed plan:

In https://github.com/radiantearth/stac-spec/issues/713#issuecomment-613992972, @m-mohr worked up a really nice example of how it might look. During the meeting, we agreed that we won't try to also use the datacube extension. That is an eventual goal as well, but we noted several challenges in terms of reconciling datacube with Zarr and CF metadata.

So here I repeat @m-mohr's example minus the datacube part

{
  // STAC collection fields
  "stac_version": "0.9.0",
  "stac_extensions": [
    "asset",
    "esm" // A new extension based on the ESM collection spec
  ],
  "id": "pangeo-cmip6",
  "title": "Google CMIP6",
  "description": "This is an ESM collection for CMIP6 Zarr data residing in Pangeo's Google Storage.",
  "extent": {
    "spatial": {
      "bbox": [[-180, -90, 180, 90]]
    },
    "temporal": {
      "interval": [["1850-01-15T12:00:00Z", "2014-12-15T12:00:00Z"]]
    }
  },
  "providers": [
    {
    "name": " World Climate Research Programme",
    "roles": ["producer","licensor"],
    "url": "https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6"
    },
    {
    "name": "The Pangeo Project",
    "roles": ["processor"],
    "url": "https://console.cloud.google.com/pangeo.io"
    },
    {
    "name": "Google",
    "roles": ["host"],
    "url": "https://console.cloud.google.com/marketplace/details/noaa-public/cmip6"
    }
  ],
  "license": "CC BY-SA 4.0",
  "links": [
    {
      "href": "https://pcmdi.llnl.gov/CMIP6/TermsOfUse/TermsOfUse6-1.html",
      "type": "text/html",
      "rel": "license",
      "title": "CMIP6: Terms of Use"
    }
  ],
  "summaries": {
    // Could hold additional metadata as defined for STAC Items, not sure what could be relevant.
  },
  // Asset extension, extended by ESM extension to support asset-level metadata (adds the `href` property), ESM also defines "column_name" and specific roles ("catalog", "attribute").
  "assets": {
    "catalog": {
      // Optional, otherwise specify esm:catalog below
      "roles": ["esm-catalog"],
      "type": "text/csv", // Previously assets.format 
      "column_name": "path",
      "title": "Catalog",
      "description": "Path to a the CSV file with the catalog contents.",
      "href": "sample-pangeo-cmip6-zarr-stores.csv"
    },
    // All attributes / vocabulary files, we may also move these out of the assets, depending on whether there's usually a "href" set or not. If not, it could simply be moved to a field "esm:attributes" with the same structure as in the ESM spec.
    "activity_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "activity_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
    },
    "source_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "source_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json"
    },
    "institution_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "institution_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json"
    },
    "experiment_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "experiment_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json"
    },
    "member_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "member_id"
    },
    "table_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "table_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json"
    },
    "variable_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "variable_id"
    },
    "grid_label": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "grid_label",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json"
    }
  },
  // ESM extension fields
  "esm:catalog": {}, // Optional, previously the "catalog dict" if no "catalog" asset is available 
  "esm:aggregation_control": {
    // As defined by the ESM spec
  }
}

One thing I changed was to define the role for the asset as esm-catalog rather than catalog. This can hopefully let a processor (like intake-esm) know that this asset has a special role within the esm extension.


I'd love some feedback on whether I remembered the meeting accurately (it was a few days ago and our notes were sparse) and whether this sounds like a good plan. The STAC folks proposed organizing a 2-hour spring to bang this out, and I think that's a great idea. I would not be free until the first week of May. If others agree (particularly need help from @andersy005 and @charlesbluca), I'll send out a Doodle.

m-mohr commented 4 years ago
  • Define an esm extention as a new valid STAC extension. That extension will probably need to live in a new repo (I propose NCAR/stac-esm), or alternatively this repo could morph into that project.

I guess you could just start with a branch here?

"type": "application/json", // Previously assets.format

That's meant to be text/csv (or whatever media type applies for CSV files), of course.

One thing I changed was to define the role for the asset as esm-catalog rather than catalog. This can hopefully let a processor (like intake-esm) know that this asset has a special role within the esm extension.

I would not be free until the first week of May. If others agree (particularly need help from @andersy005 and @charlesbluca), I'll send out a Doodle.

First week of May sounds good to me. Doodle it good, too.

andersy005 commented 4 years ago

Thank you @rabernat @jhamman & @m-mohr for putting this together! Looking forward to seeing this brought to completion.

If others agree (particularly need help from @andersy005 and @charlesbluca), I'll send out a Doodle.

The first week of May works for me as well.

rabernat commented 4 years ago

That's meant to be text/csv (or whatever media type applies for CSV files), of course.

Fixed

I have created a Doodle here: https://doodle.com/poll/756diubsfb3x5nb2 The goal of this meeting is to have a 2-hour block where we all work on this simultaneously. I am hoping that at minimum, @m-mohr, @andersy005, @charlesbuca, and myself can attend. Would also be great to have @naomi-henderson, @jhamman, @cholmes, and @matthewhanson. If you can't make all two hours, that's okay--just click "if need be" in Doodle.

The goals of the hack session are:

If time permits, we can start updating processing tools (e.g. pangeo catalog, intake-esm) to adapt to the new conventions. However, this is not the main goal.

Anything I missed?

rabernat commented 4 years ago

The winning time is

May 7 THU 1:00 PM - 3:00 PM EDT

We can use https://whereby.com/pangeo to chat / coordinate.

m-mohr commented 4 years ago

A little bit of updates before the telco: Based on the last telco, I tried to come up with a new example. I think it better aligns both specs. The biggest change and probably biggest point of discussion is splitting the vocabulary links into assets and a separate array of attribute names.

{
  "stac_version": "0.9.0",
  "stac_extensions": [
    "collection-assets",
    "https://github.com/NCAR/esm-collection-spec/tree/master/schema.json"
  ],
  "id": "pangeo-cmip6",
  "title": "Google CMIP6",
  "description": "This is an ESM collection for CMIP6 Zarr data residing in Pangeo's Google Storage.",
  "extent": {
    "spatial": {
      "bbox": [[-180, -90, 180, 90]]
    },
    "temporal": {
      "interval": [["1850-01-15T12:00:00Z", "2014-12-15T12:00:00Z"]]
    }
  },
  "providers": [
    {
      "name": " World Climate Research Programme",
      "roles": ["producer","licensor"],
      "url": "https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6"
    },
    {
      "name": "The Pangeo Project",
      "roles": ["processor"],
      "url": "https://console.cloud.google.com/pangeo.io"
    },
    {
      "name": "Google",
      "roles": ["host"],
      "url": "https://console.cloud.google.com/marketplace/details/noaa-public/cmip6"
    }
  ],
  "license": "proprietary",
  "links": [
    {
      "href": "https://pcmdi.llnl.gov/CMIP6/TermsOfUse/TermsOfUse6-1.html",
      "type": "text/html",
      "rel": "license",
      "title": "CMIP6: Terms of Use"
    }
  ],
  "assets": {
    "thumbnail": {
      "href": "logo.png",
      "title": "A preview image for visualization.",
      "type": "image/png",
      "roles": ["thumbnail"]
    },
    "catalog": {
      "href": "sample-pangeo-cmip6-zarr-stores.csv",
      "title": "Catalog",
      "description": "Path to a the CSV file with the catalog contents.",
      "type": "text/csv",
      "roles": ["esm-catalog"],
      "esm:column_name": "path"
    },
    "activity_id": {
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json",
      "type": "application/json",
      "roles": ["esm-vocabulary"],
      "esm:column_name": "activity_id"
    },
    "source_id": {
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json",
      "type": "application/json",
      "roles": ["esm-vocabulary"],
      "esm:column_name": "source_id"
    },
    "institution_id": {
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json",
      "type": "application/json",
      "roles": ["esm-vocabulary"],
      "esm:column_name": "institution_id"
    },
    "experiment_id": {
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json",
      "type": "application/json",
      "roles": ["esm-vocabulary"],
      "esm:column_name": "experiment_id"
    },
    "table_id": {
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json",
      "type": "application/json",
      "roles": ["esm-vocabulary"],
      "esm:column_name": "table_id"
    },
    "grid_label": {
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json",
      "type": "application/json",
      "roles": ["esm-vocabulary"],
      "esm:column_name": "grid_label"
    }
  },
  "esm:catalog": {},
  "esm:attributes": ["activity_id", "source_id", "institution_id", "experiment_id", "member_id", "table_id", "variable_id", "grid_label"],
  "esm:aggregation_control": {
    "variable_column_name": "variable_id",
    "groupby_attrs": [
      "activity_id",
      "institution_id",
      "source_id",
      "experiment_id",
      "table_id",
      "grid_label"
    ],
    "aggregations": [
      {
        "type": "join_new",
        "attribute_name": "member_id",
        "options": { "coords": "minimal", "compat": "override" }
      },
      {
        "type": "join_existing",
        "attribute_name": "time_range",
        "options": { "dim": "time" }
      },
      {
        "type": "union",
        "attribute_name": "variable_id"
      }
    ]
  }
}

There were recently also some discussions in STAC on how to best integrate things like zarr. Based on https://github.com/radiantearth/stac-spec/issues/779 I'm working on collection-level assets (PR is coming in the next hours), which we'll probably use for the ESM collection extension. There also have been discussions on how we could allow Items to represent "parts" of a zarr archive and came up with nullable timestamps (see https://github.com/radiantearth/stac-spec/pull/798).

cholmes commented 4 years ago

I've had some family stuff come up, so may miss thursday meeting completely, and at the very least will likely be in and out. But I don't think I'm core to it - psyched to see what the group comes up with!

rabernat commented 4 years ago

Hi All! I'm looking forward to little sprint today at 1pm EST. I suggest we convene briefly at https://whereby.com/pangeo at 1pm to discuss our work plan.

andersy005 commented 4 years ago

Sounds good 👌! I will be there at 1pm.

m-mohr commented 4 years ago

Great work today. I went through the example PRs with the new JSON schema in #27 and left comments how they could validate.

rabernat commented 4 years ago

Hi Folks--sorry for letting this hang for so long. I'd like to get the PRs merged asap. It seems like the only PR missing is @jhamman's narrative description of the new spec. Am I remembering things correctly?

I have assigned reviewers to all the PRs. Let's get them reviewed, approved, and merged.

m-mohr commented 4 years ago

It seems there are some points left for discussion, especially self-contained catalogs (i.e. esm:catalog).

jhamman commented 4 years ago

Just wanted to drop a quick note here to highlight the upcoming STAC sprint (https://medium.com/radiant-earth-insights/join-us-for-stac-sprint-6-our-first-fully-remote-event-28e118a5279c). Might be a good opportunity to push things forward on the esp spec front.

cholmes commented 4 years ago

Would definitely be great if people could join. I'd really love to get at least a small sample zarr+stac catalog up. May even be able to structure some sort of 'prize' to make that happen, as there are sponsors interested in seeing this happen, and I think it'd be a great test to ensure STAC is ready for 1.0

m-mohr commented 4 years ago

I'll be available the first and last day of the data sprint until around 11pm CEST, if you need me for anything.

andersy005 commented 4 years ago

@m-mohr, I plan to be at the sprint (excluding times I have meetings at work). Happy to help with getting what we started in #27 done at the sprint

rabernat commented 4 years ago

I'm curious how this issue has progressed. Are we any closer to being able to catalog our cloud-based data in STAC? Is there a way I can help?