fils / CDFRegistryWG

EarthCube CDF Registry Working Group on self hosted facility metadata via HTML5 microdata
5 stars 2 forks source link

EarthCube CDF Registry Working Group

TLDR;

The work of the registry working group can be summed up rather quickly. Use existing vocabularies like schema.org and re3data terms to expose facility metadata using web architecture patterns. Leverage HTML5 microdata publishing, JSON-LD and standard web architecture (hypermedia) to both expose and collect metadata.

About

The EarthCube Council of Data Facilities (CDF) formed the Registry Working Group to review alignment of existing approaches to research facility description and discovery. The involved parties include the EarthCube CDF, Coalition for Publishing Data in the Earth and Space Sciences (COPDESS) and the Registry of Research Data Repositories (re3data).

Documents

Repository structure

Simple Scenario

  1. A facility has both metadata about the facility as well as links to service description documents like Swagger, OGC or Threads.
  2. These are assembled together into a JSON-LD document following schema.org patterns with possible use of external vocabularies. This is then placed into the facility landing page (or other designated page) via
    <script type="application/ld+json">
  3. Items that can not be defined by schema.org can be then be defined via an external vocabulary
  4. The white list of site/URLs is feed through something like https://github.com/fils/contextBuilder or by DateOne tools. This example code will look for schema.org JSON-LD packages defined in item 2. More advanced crawling solutions might use tools like: https://github.com/anaskhan96/soup or https://github.com/PuerkitoBio/fetchbot

After reading in the JSON-LD it could be converted to RDF for a triple store or other data storage or index approaches used by a harvesting group.
There is no blessed harvesting or presentation site. Any number of groups or organizations could harvest and provide access to this material.

The following image gives a brief overview of how facilities might take their descriptor documents and metadata and expose this material up through a workflow to aggregation and interface clients.

Image of Flow

Errata

On ad hoc implementation

As noted a test crawler, harvester and indexer is being developed at contextBuilder. This is a simple (and not production ready) application for harvesting from a whitelist and extracting the JSON-LD package. The next step will be to convert this JSON-LD to triples and moved into a standard triple store. A focused JSON-LD crawler is also in development at https://github.com/ESIPFed/snapHacks/tree/master/sh01-jsonldCrawl

On external vocabularies

The registryC5 file is testing some external vocabulary uses. It is valid JSON-LD but Google will always through an error since it doesn't see this as a property of some known schema.org class. This should be fine and I have tested this, but it is always a worry with Google that you will not know when how they deal with this case will be changed. Their typical response has been, "try and get things you need in core schema.org".