DACCS-Climate / roadmap

DACCS roadmap
0 stars 0 forks source link

Publishing mechanism #60

Open mishaschwartz opened 2 months ago

mishaschwartz commented 2 months ago

We would like to provide users with a mechanism that allows them to publish a workflow or data product (product) that they have created on a Marble node.

When a user wants to publish a product, it should be accompanied with some metadata that helps other users search for and use the product for their own research. Another user should also be able to recreate the product using the same steps as the original author.

Publishing requirements (note that this is not an exhaustive list, feel free to add to it):

The entire publishing mechanism should include:

Suggested steps to take for this project:

  1. Research the following if you are not already familiar:

  2. Research/compile example of the sort of data and workflows that users may want to publish on Marble

    • you may want to do a literature review of climate research papers that have accompanying data products
  3. Compile a list of metadata that should accompany a published product

    • describe which metadata is: always required, required depending on the product type, optional
    • describe acceptable values for each metadata type
  4. Translate the metadata described above as a STAC extension(s) so that published products can be stored as a STAC entry

  5. Design the UI for users to request that their product be published

  6. Design the UI for node administrators to accept, request changes, or reject a request

    • requests for changes need to be communicated to the original requestor and there needs to be a UI for them to ammend their request and re-submit it for review
  7. Implement the UI for steps 5 and 6 above

  8. Write software that takes a data product (and accompanying metadata) once it has been accepted for publishing and:

    • makes it available through the THREDDS server (or similar as appropriate)
      • data can/should be shared through THREDDS or geoserver
      • workflows will probably need a different way of hosting them. Consider that workflows can be defined as jupyter notebooks, weaver jobs, cwl files, etc.
    • adds the metadata as an entry on the STAC API so that the new product is searchable

Deliverables:

Participants/Roles:

fmigneault commented 1 month ago

We are currently working on similar requirements for GeoDataCubes (GDC) in OGC Testbed-20 regarding how to perform a workflow processing on multi-dimensional spatio-temporal data, and how the resulting data products/collections can keep track and report their provenance with the processing pipeline (see also the Integrity, Provenance, and Trust (IPT) track) - ie FAIR principles.

A few relevant documents/issues of ongoing work items for metadata:

And below, previous searches I did regarding relevant STAC extensions when involving metadata relevant for machine learning or notable filtering of data :

→ Accuacy: combination of STAC extensions

→ Filtering / pre-processing:

→ Data Sample Elements (samples from DataLoader?):

Note that are many more extensions for different use cases and data types, and the list often expands: https://stac-extensions.github.io/