EOEPCA / resource-discovery

Resource Discovery
https://eoepca.readthedocs.io/projects/resource-discovery/en/latest/
Apache License 2.0
0 stars 1 forks source link

Support workflow resource type in Resource Catalogue #56

Closed kalxas closed 5 days ago

jonas-eberle commented 1 month ago

@kalxas What is meant by workflow? Do we want to specify this further (e.g., CWL, OpenEO process graph)?

j08lue commented 1 month ago

Related: I have trawled the pycsw and pygeoapi repos for sample resources of various type - no workflows there, though, afaics:

j08lue commented 1 month ago

@GarinSmith to link to related EarthCODE story, pls.

GarinSmith commented 1 month ago

Hi @j08lue, The reference you need is here https://github.com/orgs/ESA-EarthCODE/projects/5/views/1?pane=issue&itemId=72091040 I may need to give you access to this, but there is a summary below (which will save you time reading the link above).

In summary:

After an initial review with Angelos. We agreed that we should use OGC API Records to

This is important because it means (as Richard suggested)

We would like a formal way of validating a schema. Can you please suggest something?

E.g. we would like EOEPCA+ guidance on how to validate schema compliance? This seems quite complicated. E.g. see
https://json-schema.org/implementations#validators-web-(online) or https://json-schema.org/implementations#command-line

The online schemas to not seem to cope with $ref instances and there seem to be lots of $refs for OGC API Records.

I have looked at command lined solutions like Polyglottal JSON Schema Validator and these seem to struggle too. E.g. pajv validate -s recordGeoJSON.yaml -d record.json -r recordCommonProperties.yaml -r time.yaml -r linkBase.yaml -r linkTemplate.yaml (this does not yet seem to work yet)

Could you provide a working example/solution to validate a valid OGC API Record? We could then use this approach in EarthCODE using the above strategy.

For above I used (schemas) https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordJSON.yaml https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordGeoJSON.yaml and (records) https://github.com/opengeospatial/ogcapi-records/blob/master/core/examples/json/record.json I am not sure about the compatibility of the above, but I just wanted to get a schema validation test.

j08lue commented 1 month ago

Sure thing. @kalxas, let us discuss, how much Records validation should happen on the API vs UI level.

GarinSmith commented 4 weeks ago

Thanks. We would like to know first a reliable way to perform this validation, so that: i) We can validate OGC API Records on the EarthCODE Catalog, before we publish them from a Platform. ii) We can validate OGC API Records on a Platform, before we try to publish them to the EarthCODE Catalog. This will help avoid operational issues by performing validation in suitable places along the operational pipeline.

kalxas commented 1 week ago

This would mean validating against, directly: https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordGeoJSON.yaml .

The problem here is that the schema is in YAML, and tools like Python check-jsonschema do not do well with JSON Schema via YAML, especially when there are $ref’s involved.

OGC typically pushes out the YAML schemas onto http://schemas.opengis.net/. We need JSON schemas.

kalxas commented 1 week ago

@jonas-eberle @j08lue any type of workflow could be represented with a metadata record. The goal of this task is to define a record schema with extra properties to describe metadata about a workflow

kalxas commented 1 week ago

@GarinSmith I got feedback from @tomkralidis WMO have defined their own schemas based on OGC API Records: See https://github.com/wmo-im/wcmp2/tree/main/schemas https://schemas.wmo.int/wcmp/2.0.0/schemas/

GarinSmith commented 1 week ago

@kalxas Thanks. That helps. I don't care about the format (JSON or YAML) as long as it validates against a specific OGC API Records implementation that we can use for a workflow or experiment. I had the same problem above using YAML and $ref.

It is very helpful to see the spec referenced here too https://schemas.wmo.int/wcmp/2.0.0/standard/wcmp-2.0.0.pdf

I got this to work using check-jsonschema --schemafile wcmp2-bundled.json de-dwd.surface-weather-observations-realtime.json Note the example provided in the above document uses check-jsonschema --schemafile schemas/wcmp2-bundled.json examples/msc-swob-realtime.json although I cannot find msc-swob-realtime.json in examples, but never mind.

I tried it with https://github.com/opengeospatial/ogcapi-records/blob/master/core/examples/json/record.json and got check-jsonschema --schemafile wcmp2-bundled.json record.json record.json::$.properties.contacts[0]: 'organization' is a required property however, this was easy to fix and seems reasonable.

I also tried it with an example openEO implementation https://github.com/ESA-APEx/apex_algorithms/blob/main/algorithm_catalog/worldcereal_inference.json and got check-jsonschema --schemafile wcmp2-bundled.json worldcereal_inference.json Schema validation errors were encountered. worldcereal_inference.json::$: 'time' is a required property worldcereal_inference.json::$: 'geometry' is a required property worldcereal_inference.json::$.properties.contacts[1]: 'organization' is a required property

I am wondering why a workflow would require i) A time ii) A geometry

tomkralidis commented 1 week ago

Note that OGC API - Records allows for time and geometry to be encoded as null. This could be used as part of describing any resource without spatial or temporal properties, while keeping broad interoperability given use of OGC API - Records and GeoJSON.

GarinSmith commented 1 week ago

Thanks.
I added "time": null, "geometry": null and it worked as you say. I think this is a good starting point for EarthCODE.

@kalxas , hopefully EOEPCA+ Catalog (or pycsw) will ingest in this format? I think I tried this before with STAC and I could not ingest. Hopefully this will not be an issue for OGC API Records with the latest version of pycsw. I believe "pycsw supports OGC API - Records - Part 1: Core, version 1.0 by default."

kalxas commented 1 week ago

@GarinSmith pycsw can ingest both OGC API Record and STAC, it has been demonstrated in various EOEPCA demos.

We need to define/extend the record to describe the workflows

GarinSmith commented 1 week ago

@kalxas , great thanks. That is very good to know.

Can we start by "defining" and using the current spec, so we can flexibly reference the various different workflow types that can be described externally. This seems like a separation of concerns we need. We also need to try and start off by using what we already have if possible.
E.g. using something like?

OpenEO

links": [ { "rel": "openeo-process", "type": "application/json", "title": "openEO Process Title", "href": "https://raw.githubusercontent.com/ESA-APEx/apex_algorithms/max_ndvi_composite/openeo_udp/examples/max_ndvi_composite/max_ndvi_composite.json" }

OGC API Processes

links": [ { "rel": "ogcapi-process", "type": "application/json", "title": "OGC API Process Title", "href": "https://owncloud.spaceapplications.com/owncloud/index.php/s/iCk60Kmry77o2l6/download" }

Python Processes

links": [ { "rel": "python-process", "type": "application/json", "title": "Python Process Title", "href": "https://github.com/GEUS-SICE/SICE/blob/master/S3_wrapper.sh" }

Jupyter Notebook Processes

links": [ { "rel": "jupyter-notebook", "type": "application/json", "title": "Python Process Title", "href": "https://github.com/...../..../file.ipynb" }

kalxas commented 1 week ago

IANA defines the link relations: https://www.iana.org/assignments/link-relations/link-relations.xhtml

Also see https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes/rel which mentions:

The current registries for the possible values of the rel attribute are the IANA link relation registry, the HTML Living Standard, and the freely-editable existing-rel-values page in the microformats wiki, as suggested by the Living Standard. If a rel attribute not present in one of the three sources above is used some HTML validators (such as the W3C Markup Validation Service) will generate a warning.

kalxas commented 1 week ago

My plan is to draft some initial proposal for the next demo.

GarinSmith commented 1 week ago

Thanks @kalxas,

I saw a reference to IANA before, but could not find the links above. My current thoughts are that OGC API Records seems to provide most of what we currently seem to need.

However 1) It would be useful to know what a workflow type is somehow. E.g. openeo, ogc api processes, JNB, free format Python and so on. 2) It would be useful to know if a processes is a Workflow (generic) , Experiment (specific) or Dashboard (GUI to Workflow).
I am am sure about the best way to achieve this and would welcome your guidance?

I note the IANA links above do not seem to be interested in things like Workflows or Processes or Process Types. However, this is important to us, because we need to know the type of link we are looking at, so that we know better what platform can handle that type of link.

I note that the examples above do successfully validate when I use the check-jsonschema tool. They also correspond with the approach some platforms already use, so they are a useful starting point to move forwards from.
At his stage I will suggest that by default all EarthCODE platforms use check-jsonschema tool for schema validation of OGC API Records when appropriate. Again this seems like a very good starting point.

kalxas commented 5 days ago

I have created a new repository that will host the metadata schema for EOEPCA profile(s): https://github.com/EOEPCA/metadata-profile/

The resource schema was initialized with the OGC API Records schema: https://github.com/EOEPCA/metadata-profile/blob/master/schemas/resource.yaml#L7

An enumeration is provided for the resource type which can be further expanded to support various types (as required above). In my opinion the workflow type has to be defined at the resource/record level, not at the link level: https://github.com/EOEPCA/metadata-profile/blob/master/schemas/resource.yaml#L9-L15

From that initial resource definition, I have created a JSON Schema bundle as described in WMO by @tomkralidis https://github.com/EOEPCA/metadata-profile/tree/master/schemas

Validation process described here: https://github.com/EOEPCA/metadata-profile/tree/master/schemas#validating-an-emp-record

GarinSmith commented 2 days ago

Thanks @kalxas,

It might help to test this against a typical scenario that EarthCODE might want to use. I can validate the current openeo-process example attached (worldcereal_inference2.json) E.g. check-jsonschema --schemafile wcmp2-bundled.json worldcereal_inference2.json ok -- validation done

Could you please update it to include a typical "EOEPCA resource type" for instance a "workflow" This could be useful when applying FAIR principles (Find, Access etc)

However, how will we know what type of workflow we are dealing with (openeo-process in this case)? E.g. openeo-process ogcapi-process JNB etc

Does the type of "workflow" have to go at the link level if there is more than one link? I think it helps to refer to a real world EarthCODE example that we might one day use.

Many thanks

Garin

GarinSmith commented 2 days ago

worldcereal_inference2.json

GarinSmith commented 2 days ago

Hi @kalxas and @rconway, I totally agree with the point Angelos just made in the update. We just need a starting point that we can use to ingest and then evolve further (in fact I already had this using the previous schema provided by Tom). I need to get this in time for EarthCODE when we start work very shortly. Having a first version from Angelos that also validates with an EarthCODE potential example would be great.

Angelos can you please tweak worldcereal_inference2.json above, so that it validates against your latest schema? That will be a great starting point. I can only get this to work partially and I had to guess some values to fix one validation issue.

It would help if there was clear meaning to the following EOEPCA resource types that map to the EarthCODE utilisation domain. I think they they are OK, but here are my assumptions.

Note that EarthCODE has the concept of Workflow, Experiment, Application and Product. We need to map to these somehow, hence my comment above.

kalxas commented 1 day ago

Thank you @GarinSmith I will look at the provided record and try to make it validate. The schema provided is just a first draft, we will need to expand, so I would not provide it yet to EarthCODE for production purposes.

kalxas commented 1 day ago

https://github.com/ESA-EarthCODE/open-science-catalog-metadata/issues/233#issuecomment-2387717224

kalxas commented 1 day ago

The is a bug in the schema provided, will work to fix it

kalxas commented 21 hours ago

Schema updated: https://github.com/EOEPCA/metadata-profile/commit/24d2755d1eed34e1e5441cd1b22576adc72a37ea

kalxas commented 21 hours ago
check-jsonschema --schemafile resource.json worldcereal_inference2.json
Schema validation errors were encountered.
  worldcereal_inference2.json::$: 'geometry' is a required property
  worldcereal_inference2.json::$.properties.formats[0]: 'GeoTiff' is not of type 'object'
  worldcereal_inference2.json::$.properties.type: 'apex_algorithm' is not one of ['dataset', 'service', 'process', 'workflow']
kalxas commented 20 hours ago

@GarinSmith this is the example that validates:

worldcereal_inference2.json