What is the purpose of this repo?

Central location for all schema information for the ACCESS-NRI organisation.

Why?

Consistent approach across ACCESS-NRI organisation

Improves productivity as there is only one source of truth to find schema. Reduces the barrier for those new to the subject area who are not in a position to create their own schema due to lack of background knowledge.

Also naturally leads to interoperability: if everyone uses the same schema they re-use and connect with existing schema, which get the same connectivity "for free".

Such interconnected schema enables building knowledge graphs. A knowledge graph, or semantic network, is a graph based representation of the connections between objects contained in the schema. A knowledge graph can facilitate traversing data in novel ways that were previously unknown.

Knowledge graphs are a sort of ad hoc ontology.

Discoverability

Adding schema to webpages in json-ld format promotes discovery. It is the standard for semantic searching and cataloguing on the web.

This can lead to connections with other data providers, which adds value with little specific effort.

aidanheerdegen commented 1 year ago

How?

Format

The standard for schema on the web is RDF. That is what is used by schema.org and Bioschemas. Bioschemas is probably the one we should be following most closely.

An example schema is Bioschema Dataset, and an example record is

{
    "@context": "https://schema.org/",
    "@type": "Dataset",
    "http://purl.org/dc/terms/conformsTo": { "@type": "CreativeWork", "@id": "https://bioschemas.org/profiles/Dataset/1.0-RELEASE" },
    "@id": "https://doi.org/10.5281/zenodo.5743204",
    "identifier": "10.5281/zenodo.5743204",
    "name": "RDF version of the data from Choi, JS. et al. Towards a generalized toxicity prediction model for oxide nanomaterials using integrated data from different sources (2018)",
    "description": "This is an RDFied version of the dataset published in Choi, JS., Ha, M.K., Trinh, T.X. et al. Towards a generalized toxicity prediction model for oxide nanomaterials using integrated data from different sources. Sci Rep 8, 6110 (2018). The original dataset publication DOI: https://doi.org/10.1038/s41598-018-24483-z. The Original publication authors: Jang-Sik Choi, My Kieu Ha, Tung Xuan Trinh, Tae Hyun Yoon & Hyung-Gi Byun",
    "license": "https://creativecommons.org/licenses/by/4.0/legalcode",
    "url": "https://zenodo.org/record/5743204",
    "keywords": "oxide, nanomaterial, toxicity, prediction",
    "creator": [
      {
        "@type": "Organization",
        "name": "NanoSolveIT"
      }
    ],
    "datePublished": "2021-11-30",
    "citation": { "@type": "CreativeWork", "@id": "https://doi.org/10.1038/s41598-018-24483-z", "name": "Towards a generalized toxicity prediction model for oxide nanomaterials using integrated data from different sources" }
  }

Other use cases

This all very well, but how does this map to relational databases like the ones typically used for data indexing?

In a general sense not very well. However, the reverse mapping, from SQL DB to RDF is more straightforward.

If we're playing mostly in the relational DB/SQL space and so want the RDF mapping for interoperability with the wider world then that will affect how complex we let the schemas become. Or we have a strict hierarchy of schema, with a tighter definition at the bottom, which is interoperable with SQL, and higher level schema with more freedom that allow for more connectivity.

dougiesquire commented 1 year ago

Thanks for providing these details and context @aidanheerdegen

(Possibly) relevant climate-data examples

METACLIP is a framework for keeping track of the provenance of climate data products. This uses the W3C PROV model for provenance interchange on the web. See http://www.metaclip.org/about and https://www.sciencedirect.com/science/article/abs/pii/S1364815218305036
ESMValTool has provenance logging that also uses W3C PROV. See https://docs.esmvaltool.org/en/latest/community/diagnostic.html#recording-provenance
the rook package (which allows remote access to climate data) also uses W3C PROV for provenance. See https://rook-wps.readthedocs.io/en/latest/prov.html
Also looks interesting: https://gitlab.dkrz.de/data-infrastructure-services/climate_data_provenance

ACCESS-NRI / schema

Purpose, motivation and implementation #1