Extract software metadata from the web (service endpoints and/or webpages)

proycon commented 2 years ago

The harvesting pipeline that is being implemented currently (#33) is set up in such a way that the source-code is always the most authoritative place for holding software metadata descriptions.

However, there is a distinction between the software source code and service instances of that software, and the latter may add some metadata that is not applicable to the source as such. Instances are hosted on a particular URL and may have particular access limitations. We want to make that distinction explicit.

In the tool source registry for the harvester, we therefore provide the link to the source code alongside the web endpoints. The harvester first queries the source code repositories and converts the metadata in there to schema.org/codemeta's @SoftwareSourceCode, then it queries the web endpoints and enriches the metadata in the way proposed in codemeta/codemeta#271 .

How can websites and webservices provide metadata? I want to support the following for the harvester pipeline:

[x] Support inline schema.org metadata in a <script type="application/ld+json"> block, with @type any subclass of schema:SoftwareApplication or any of the other ones proposed in codemeta/codemeta#271, including schema:WebAPI and schema:WebPage.
- This is the most explicit form to provide metadata and the only one that ensures that all metadata ends up in the harvested end-product.
- See also https://developers.google.com/search/docs/advanced/structured-data/sd-policies
- Support for microdata wil be deferred to a later stage (https://schema.org/docs/gs.html)
[x] Support for webservices providing an OpenAPI specification (in json or yaml), parse and convert at least the "Info" block to codemeta.
[x] Support for the fallback option: parse certain meta tags in the HTML head

proycon commented 2 years ago

It may be worth identifying if there are already CLARIAH services and websites that make their tool metadata available in other ways that may be harvestable (i.e. published by the web endpoint itself, not some other higher-order registry). An important example currently is CLAM, widely used for WP3 webservices and outputting metadata in its own XML format; I will make that output an OpenAPI Info block too (proycon/clam#32).

Please comment if you can answer what metadata descriptions certain CLARIAH partners are currently using?

ddeboer commented 2 years ago

Should the type of service instance be documented with the software and/or be derived from the service definition as it is retrieved over HTTP by the harvester? Example: the fact that software x has an OpenAPI endpoint available at URL y and a SPARQL endpoint at URL z.

proycon commented 2 years ago

I am indeed hoping that the type of the service can be automatically extracted, and once extracted I want to represent these webservices using the pending WebAPI proposal ( schemaorg/schemaorg#2635 , schemaorg/schemaorg#1423) . The type of instance would fit their conformsTo property. This will be fairly minimal though. I think that's an important limit to our 'tool discovery' scope; we will merely link to these existing API specifications but not try to redo, reinvent them or convert all aspects of them. Anybody wanting to actually interface with the service (input parameters, output types, return codes etc) needs to dig deeper and parse the linked specification themselves.

I must also add describing web services is still relatively low on the priority list. Describing the schema:WebApplication (i.e. a web interface for human end-users) has more priority.

From the perspective of the harvester and the metadata it produces. I see the source code metadata as the primary representation. This schema:SoftwareSourceCode will be linked to service instances (e.g a schema:WebApplication, a schema:WebAPI or even a schema:WebPage) via the schema:targetProduct property. (https://github.com/codemeta/codemeta/issues/271). As I envision it now, the tool store API (#34) will serve a whole bunch of json files (and also have a SPARQL endpoint), one per tool, each representing a software source code that links to all service instances (bottom up). I hope this makes some sense :)

CLARIAH / clariah-plus

Extract software metadata from the web (service endpoints and/or webpages) #92