Describing data accessed from static endpoints (e.g. S3 object stores)?

cboettig commented 1 year ago

The Schema.org's Dataset model, and the documentation here, describes accessing data either through Service Endpoints or as Data Downloads.

It is not entirely clear to me how to document data that is accessed through a object store which not meant to be downloaded as individual assets. For example, consider the GBIF parquet snapshots on AWS S3. Yes, technically we can define a contentUrl for each of the 2053 parquet shards in an occurance.parquet sub-directory, but really such data is intended to operate in a model somewhat closer to a service endpoint, where a tool like Apache Arrow is used to open over-the-wire connection to the database root. However, it doesn't seem that a service endpoint is the right choice either, as this approach is not intended as a set of curl-based REST calls.

Can a distribution element include list-valued argument to contentUrl (i.e. for a multi-part file?)
Does sceince-on-schema have advice about URI construction in this context? i.e. should bucket protocol notation, like s3:// or abfs::// be used?

FWIW, I find the examples of stac metadata documentation instructive and very practical here, e.g. GBIF stac JSON of azure. Notably this identifies only the parquet 'root' as the href, and uses the bucket URI notation. that approach seems to work well with existing tooling and workflows.

mbjones commented 1 year ago

Great questions, @cboettig , and thanks for raising them. I think updating our guidance to address the issues you raise would be really helpful to many groups. @ashepherd maybe we can add this to the list we generated last meeting of next priorities, and discuss at the next meeting? I'll miss the next meeting while I am on vacation, but I'll put a vote in here for addressing this issue as we are grappling with similar concerns wrt STAC metadata for collections.

fils commented 1 year ago

@cboettig interesting question....

I think I would start with the schema:url for the https://gbif-open-data-us-east-1.s3.us-east-1.amazonaws.com/index.html#occurrence/2023-02-01/occurrence.parquet/ as it is a more human URL.

schema:distribution would be more for a single download of the data though, which is not the case in a sharded parquet file/directory of files.

Just a first thought would be to use something like potentialAction to point to an Action type. Once there is an Action you can define a target its is easy to layer in the URL in the s3:// format.

This is from some unrelated approaches, but might be an interesting starting point.

 "potentialAction": {
    "@type": "Action",
    "name": "Use My API",
    "description": "Use the API to retrieve data from my organization.",
    "@id": "https://us-central1-top-operand-112611.cloudfunctions.net/function-1",
    "result": {
      "@type": "DataDownload",
      "encodingFormat": "text/plain",
      "description": "a simple text result for the RGB counts"
    },
    "target": {
      "@type": "EntryPoint",
      "urlTemplate": "https://us-central1-top-operand-112611.cloudfunctions.net/function-1",
      "httpMethod": "POST",
      "contentType": [
        "image/jpeg",
        "image/png"
      ]
    },
    "object": {
      "@type": "ImageObject",
      "description": "A JPEG or PNG to analyze the RGB counts"
    }
  },

ESIPFed / science-on-schema.org

Describing data accessed from static endpoints (e.g. S3 object stores)? #240