ESA-EarthCODE / open-science-catalog-metadata

https://esa-earthcode.github.io/open-science-catalog-metadata/
3 stars 9 forks source link

Review latest workflow schemas for EOEPCA+ #233

Open GarinSmith opened 1 month ago

GarinSmith commented 1 month ago

EOEPCA+ will look to define a schema for 1) Reproducible job details (Workflow Metadata) 2) Replicable workflow (Experiment Metadata)

See - System level https://github.com/orgs/EOEPCA/projects/4/views/13?sliceBy%5Bvalue%5D=Resource+Discovery&pane=issue&itemId=60227850

See - BB level https://github.com/orgs/EOEPCA/projects/7/views/1?filterQuery=workflow&pane=issue&itemId=69113579

Garin to discuss with @rconway and Angelos whose GitHub id I cannot find yet. Garin also to confirm that EOEPCA+ Catalogue can ingest and discover OGC API Records. We believe that we will need to write a new Front End in the portal to Find and Access OGC API Records in the same way we do for STAC.

pycsw supports OGC API - Records - Part 1: Core, version 1.0 by default. See https://docs.pycsw.org/en/latest/oarec-support.html Angelos has confirmed that OSC currently has support to ingest and discover OGC API Records.

Angelos noted that "Open Science Catalog is 2 versions behind, several fixes and new features have been implemented in the last 6 months or so"

Angelos is off from 9 Aug to 9 Sep, but I will meet him tomorrow for his advice on EOEPCA+ will look to define a schema for 1) Reproducible job details (Workflow Metadata) 2) Replicable workflow (Experiment Metadata)

Please also refer to https://github.com/orgs/ESA-EarthCODE/projects/5/views/8?pane=issue&itemId=72092886

GarinSmith commented 1 month ago

After initial review with Angelos. We agreed that we should use OGC API Records to

This is important because it means (as Richard suggested)

E.g. openEO

https://github.com/ESA-APEx/apex_algorithms/blob/main/algorithm_catalog/worldcereal_inference.json Link "rel": "openeo-process" "rel": "git" "rel": "service" "rel": "license Additional Metadata "properties": { "created": "2024-05-17T00:00:00Z", "updated": "2024-05-17T00:00:00Z", "type": "apex_algorithm", "title": "ESA worldcereal global maize detector", "description": "A maize detection algorithm.", "cost_estimate": 0.1, "cost_unit": "platform credits per km\u00b2", etc

Open Science Catalog

catalog.osc.earthcode.eox.at/collections/metadata:main/items/HCA_L2E_CS_LTA__SIR1SAR_FR_20150331T150158_20150331T150200_D001?f=json "links" Addional Metadata "properties": { "title": "HCA_L2E_CS_LTA__SIR1SAR_FR_20150331T150158_20150331T150200_D001", "description": "HYDROCOASTAL Final Product: ........ "datetime": "2023-02-10T08:45:21.061533Z", "start_datetime": "2015-03-31T15:02:32.858513+00:00", "end_datetime": "2015-03-31T15:02:34.864426+00:00", "created": "2023-02-10T08:45:21.061533+00:00" }

Note that above example provided by Angelos seems to refer to a wms service and not a CWL file. This can be clarified when Angelos returns.

Next Steps 1) Review different examples 2) Angelos to make clearer suggestion on return for standard schema 3) Get feedback from platforms when we know what they are.

GarinSmith commented 3 weeks ago

It is important that this approach also supports the types of workflows identified by @edobrowolska in https://github.com/ESA-EarthCODE/portal/issues/17

Ewelina, identified a number of scripts that can be considered workflows. E.g https://github.com/diarmuidcorr/Lake-Channel-Identifier/blob/v1.0/Landsat-8%20SGL%20and%20Channel%20Classifier (Python script) https://github.com/GEUS-SICE/SICE/blob/master/S3_wrapper.sh (Python script)

These scripts might be regarded as unstructured workflows, perhaps like a Jupyter Notebook. It may be that at some point they might be converted to a more formal workflow like for instance a CWL file (OGC API Processes). However there is no reason why these unstructured scripts cannot be used and supported by EarthCODE using the above approach.

E.g.

{
  "rel": "git",
  "type": "application/json",
  "title": "Git source repository",
  "href": "https://github.com/diarmuidcorr/Lake-Channel-Identifier/blob/v1.0/Landsat-8%20SGL%20and%20Channel%20Classifier"
},

or

{
  "rel": "git",
  "type": "application/json",
  "title": "Git source repository",
  "href": "https://github.com/GEUS-SICE/SICE/blob/master/S3_wrapper.sh"
},

The above syntax may not be 100% correct, but hopefully, it demonstrates what is possible with OGC API Records.

It may be that i) Angelos can recommend an elaborated standard approach. ii) One/all of the chosen EarthCODE contractors can suggest a suitable schema that already works with the existing platforms . iii) A combination of the above works.

GarinSmith commented 2 weeks ago

We need to confirm the schema that will be used for validation of OGC API Records See https://github.com/opengeospatial/ogcapi-records/tree/master/core/openapi/schemas E.g. https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordJSON.yaml or https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordGeoJSON.yaml

Hopefully we can test some of the above examples using the correct schema.

GarinSmith commented 1 week ago

I just reviewed this approach with EOEPCA+ and will link then to this user story to help clarify our requirements. See https://github.com/EOEPCA/resource-discovery/issues/56

I have asked for EOEPCA+ guidance on how to validate schema compliance? This seems quite complicated. E.g. see
https://json-schema.org/implementations#validators-web-(online) or https://json-schema.org/implementations#command-line

The online schemas to not seem to cope with $ref instances and there seem to be lots for OGC API Records.

I have looked at command lined solutions like Polyglottal JSON Schema Validator and these seem to struggle too. E.g. pajv validate -s recordGeoJSON.yaml -d record.json -r recordCommonProperties.yaml -r time.yaml -r linkBase.yaml -r linkTemplate.yaml (this does not yet seem to work yet)

For above I used (schemas) https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordJSON.yaml https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordGeoJSON.yaml and (records) https://github.com/opengeospatial/ogcapi-records/blob/master/core/examples/json/record.json

GarinSmith commented 1 week ago

I also got some useful strategic possibilities from EOEPCA

E.g.
from Jonas Sølvsteen ( potential platform integration with UK EO Data Hub https://github.com/os-climate/hazard/blob/main/hazard_workflow.cwl (this is meant to run among others on the UK EO Data Hub https://eodatahub.org.uk/) https://d1fzab3z0mlfhy.cloudfront.net/ https://radiantearth.github.io/stac-browser/#/external/pgstac.demo.cloudferro.com/

from Gérald FENOY https://ospd-02.geolabs.fr/examples/cwls/algae-usecase-workflow-copernicus.cwl https://ospd-02.geolabs.fr/examples/app-package.cwl