ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 110 forks source link

Provide a JSON-LD context file or files #311

Open cmungall opened 9 years ago

cmungall commented 9 years ago

A JSON-LD context file provides an unambiguous machine-interpretable way to translate a JSON document to an RDF serialization. Even in cases where translation to RDF is not a desired goal, a JSON-LD document file can be a useful way of clarifying certain aspects of the GA4GH schema, such as how identifiers should be specified.

Broadly speaking, the JSON-LD context file would provides mappings to RDF URIs for the following

Note I am proposing this is adoped a non-invasive way. The JSON-LD context can be safely ignored by developers seeking to consume JSON, and the adoption of a JSON-LD schema should not affect modeling decisions in the Avro schemas. To make a JSON document JSON-LD all that is necessary is to add the "@context" object in the header, but even this should not be required; rather there should be a simple translation from the Avro-specified JSON to JSON-LD that adds this.

One area where immediate a GA4GH context file would immediately provide some clarification is in issue #165, where there is currently no consensus on the form for an ID to denote an ontology class. If the GA4GH uses this OBO JSON-LD context or some subset of it, this would provide an unambiguous way of writing identifiers for any OBO library class.

The use of JSON-LD contexts was brought up by @tetron in the ever expanding discussion of #264, and may help clarify certain aspects of that discussion.

A JSON-LD context file that provides complete coverage over all keys used in the union of all avro modules would be a larger task. This need not all be done at once, and need not be done as an 'official' GA4GH project (although it would be better to avoid the situation where we have competing JSON-LD contexts).

tetron commented 9 years ago

An important thing to keep in mind, the json-ld context file needs to stay in sync with the avro schema that it maps from, which may be challenging if it has to be updated manually against an evolving spec.

For the common workflow language effort, which is associated with the ga4gh containers and workflows task team, we have been working on extending the avro schema language with annotations that enable automatic generation of the json-ld context and rdfs schema from the avro schema. Example schema with annotations:

https://github.com/common-workflow-language/common-workflow-language/blob/master/schemas/draft-2/cwl-avro.yml

The processing code is here:

https://github.com/common-workflow-language/common-workflow-language/tree/master/reference/cwltool/avro_ld

A few notes:

This is formatted using yaml instead of json for ease of writing inline documentation since yaml supports multiline string literals and plain json doesn't.

It supports data type definitions, I have not tried to use it with protocol definitions, but I don't expect much additional work would be required for the json notation.

It does not support the avro IDL syntax currently used to write most ga4gh schemas. I'm not sure how best to go about implementing that.

The processor also implements record subclassing, abstract types, templates types with specialization, and documentation generation. Example:

http://common-workflow-language.github.io

I am considering splitting this out from CWL to its own project, however to make it useful to the rest of ga4gh it would need outside contributors because my time is pretty limited.

cmungall commented 9 years ago

@tetron - good point about synchronization. But we already have this when the schema changes but the implementation doesn't. It should be possible to automatically check the ld and the avro are in sync using some simple tooling.

The idea of annotating Avro with additional information is an interesting one. This would require broad agreement and substantial changes across the GA4GH. I would prefer to discuss this in a separate ticket.

The proposal I have outlined would have virtually zero impact on existing schema development and implementation, and would be an optional add-on (the only area that might be impacted would be forcing the adoption of a set of standardized prefixes for identifiers, which I think would be a good thing).

This may be sub-optimal in the long run, and it may be better to eventually tightly couple the json-ld and the avro. But perhaps best to do things incrementally?

tetron commented 9 years ago

Actually, I disagree about having zero impact on existing schema development. To apply json-ld successfully, several details that affect the ability to successfully capture all the semantics from idiomatic json that have to be accounted for in schema design:

cmungall commented 9 years ago

@tetron good points