dcppc / crosscut-metadata

7 stars 6 forks source link

Looking for comments: implementing the DATS JSON schemas #1

Open owhite opened 6 years ago

owhite commented 6 years ago

The proposed cross-cut metadata model, aka DATS, is available as machine readable JSON schemas. Instance files can be serialized as JSON and linked data support is provided via one or several JSON-LD context files. We currently provide 2 distinct JSON-LD context files based on 2 complementary, community-driven vocabularies resources: (i) schema.org, and (ii) relevant OBO Foundry ontologies.

Justification: will will use these two resources because: (a) they meet two different requirements, schema.org enables discoverability by major search engines, whilst OBO Foundry facilitates interoperability with many biomedical databases; and (b) there is no single vocabulary that fulfill the requirement of the metadata elements.

Background: Schema.org (http://schema.org) is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond; it is sponsored by Google, Microsoft, Yahoo and Yandex, Schema.org is already used by over 10 million sites to markup their web pages and email messages. Anchoring the cross-cut metadata model to such a (potentially) powerful vocabulary is very valuable and serve the specific discoverability scope; although it covers many topics, it is shallow when describing datasets, experiments, sample etc. The Oxford team has also contributed suggestion to extend schema.org, but the process is very controlled and centrally coordinated, as one would expect given the scope, and it seems that additions are prioritised according to their cross-domain applicability, as expected. The Oxford team also contributes (by participating in and leading on) activities under the Bioschemas umbrella (http://bioschemas.org), which includes major data repositories, BD2K and ELIXIR resources and is set to cover other digital objects beyond data. However, these ‘extensions’ may actually remain such and will not necessarily to be used/included in the general schema.org vocabulary; the process is still unclear. Nevertheless, also the scope of bioschemas is discoverability (especially if and when it will become clear how these will be added to/used by schema) and the vocabularies are not rich/deep.

Conclusion: the need to complement schema.org/bioschemas with OBO Foundry ontologies remains, which ensure compatibility with models such as biolinks. Relying on another framework also allows to test reactivity and responsiveness of the community when sending term requests, as gaps are identified. It thus allows to devise key performance indicators which could be used to select a resource over another one.

Future intent: This initial choice of 2 framework is by no means final. In fact, more JSON-LD context may be produced to support other needs as found in clinical context (NCIT, LOINC, CDISC-RDF). Finally, one has to stress that these framework are not mutually exclusive and in fact ought to be used together to maximize their effects and respectives values.

SusannaSansone commented 6 years ago

For those who want more info on DATS see https://github.com/datatagsuite

mfenner commented 6 years ago

Thanks @owhite, this should align very well with the work KC2 is doing on core metadata, which uses schema.org.

david4096 commented 6 years ago

I'm excited to see that JSON-LD approaches are being evaluated! Would you mind explaining what DATS offers (or restricts) over simply using JSON-LD with obo/bio/schema.org annotations?

I've seen the prefixcommons contexts, and it seems like simply using prefixes to annotate when available suits the goals of JSON-LD. To be standards compliant, one simply needs to provide a mapping between your ontology and an available ontology when JSON-LD has been used properly. Perhaps I'm confusing things, but I'm concerned that using DATS will bring us another data model when we are simply trying to provide an easy ramp to using standard translation techniques using JSON-LD. The use case described seems about JSON-LD, OBO foundry, and schemas.org.

Thanks for taking the time to explain this! I'd like to understand the benefits of using DATS over simply using proper JSON-LD context when available, and then providing mappings as necessary. For example, if your future intent is to have platforms easily extend to other contexts, how is enforcing a schema going to help that? We would like to offer ways to gradually improve the linkages in our APIs using proven standards and available JSON-LD context. There is some concern that curating to the DATS metadata model would immediately make intractable a number of our data sharing goals, too many human hours would be needed to map our metadata to this model.

schemas.org JSON-LD context OBO context from prefixcommons

agbeltran commented 6 years ago

The advantage of following a model such as DATS would be that the data will follow the same pattern, harmonizing the representation format and semantics (using the context files) and thus, allowing for validation, easy integration and querying in a unified way.

The need for multiple context files may arise due to the vocabularies not covering all the needs (e.g. schema.org does not cover specific biological/clinical terms, and that's why the proposal would be to use schema.org context files combined with e.g. OBO Foundry ontology context files, when necessary), but there might not be a need for many more context files. In any case, having the data represented in DATS would allow switching context (semantic mappings) easily.

The OBO JSON-LD context file from prefixcommons defines the prefixes for the OBO Foundry ontologies, allowing to annotate data with compact URIs relying on these prefixes. The prefixcommons context files provide a way to map URIs to compact URIs and a way to harmonize identifiers. However, only with context files is not possible to harmonize the data representation, as people could use totally different patterns. Maybe you can expand on what you mean by "simply using proper JSON-LD context when available, and then providing mappings as necessary"?

cmungall commented 6 years ago

@agbeltran is correct, the OBO JSON-LD context defines prefixes such as GO, UBERON, etc. It is the canonical way to make from an OBO CURIE such as GO:0008150 to a URL.

These are complementary with the DATS OBO contexts which provide a mapping from types in the DATS JSON Schema such as MolecularEntity to an OBO class, e.g.

https://github.com/datatagsuite/context/blob/master/obo/molecular_entity_obo_context.jsonld

I have a question re schema.org vs dats obo contexts. It seems the former is the default in the example files provided, and hence the canonical RDF interpretation of each document. Yet schema.org is not rich enough, and lacks mappings for derivedFrom and MolecularEntity. Additional granularity may be provided by bioschemas, but this doesn't have the type richness of OBO (e.g. sequence feature types with the Sequence Ontology). How do I petition to have the OBO be the default? Or perhaps there could be a hybrid, with schema.org used for people, organizations etc and OBO used for biological entities?

proccaserra commented 6 years ago

@cmungall , we created both sets of context files as they may serve distinct purpose: schema.org for discovery and OBOfoundry for reuse and interoperability. It turns out that the first conversions to RDF were done with the schemaorg context file but it is by no means excluding the use of OBO foundry based ones, quite the opposite. @agbeltran was looking at using both but ran into limitations caused by the rdflib library. We are still working at addressing this and the main goal is to demonstrate capability to use both and possibly more, as well as allow add-ons by third parties. While doing the mapping to OBOfoundry resources, I identified gaps. So I will be sending terms requests to the relevant efforts. We can discuss this during the next OBOfoundry call.