datalad / datalad-catalog

Create a user-friendly data catalog from structured metadata
https://datalad-catalog.netlify.app
MIT License
15 stars 11 forks source link

Accept and render JSON-LD metadata #341

Open jsheunis opened 1 year ago

jsheunis commented 1 year ago

It would be ideal if datalad's whole metadata handling and rendering stack could work with JSON-LD data in a seamless way.

An example use-case is the tabby-to-catalog pipeline: if we have JSON-LD records coming out of tabby files, how do we handle these records in order to have the rich semantic information rendered sensibly in a catalog?

Let's use the current catalog schema, here a part of the specific schema for a dataset, as an example to work from:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://datalad.org/catalog.dataset.schema.json",
  "title": "dataset",
  "description": "A dataset in a DataLad Catalog",
  "type": "object",
  "properties": {
    "type": {
      "description": "The type of node",
      "title": "Type",
      "type": "string",
      "pattern": "dataset"
    },
    "dataset_id": {
      "description": "The dataset ID",
      "title": "Dataset ID",
      "type": "string"
    },
    "dataset_version": {
      "description": "The dataset VERSION",
      "title": "Dataset version",
      "type": "string"
    },
    ...
}

The dataset schema has several properties that can be contained in an incoming metadata record, and a minimal amount of properties are required. There are properties that expect specific fields and formats (e.g. author, which should have givenName, familyName, etc) and there are properties that can receive generic key-value pairs (e.g. additional_display and top_display).

How should we approach updates to such a schema in order to allow JSON-LD data through

Idea 1

If we want to define catalog schema terms ourselves, i.e. turn it into a semantic schema, we could for example add a definition field to each property which would contain a definition URL from some accessible ontology.

Idea 2

We could add another generic key-value property to the schema, something like semantic_metadata, which would allow for passing the term definitions along with the key and value, for multiple records. This could then get a dedicated display area in a catalog's dataset page.

Idea 3

Perhaps the catalog should somehow evolve into something that can just interpret and render any JSON-LD document, or at least those adhering to some convention as described by a given context. This whole concept needs further exploration, and possible a paradigm shift in terms of how metadata is added to a catalog and rendered by it.

jsheunis commented 1 year ago

Another idea:

The schema can accept additional fields without failing to validate. This means that metadata containing new fields can be passed to the catalog without having to update the catalog schema. If e.g. an extra property additional_display_definitions is passed, where this is an object providing semantic definitions of the keys and values in additional_display, these definitions could be rendered in the catalog as extra information.

For example, let's say we have additional_display property of a dataset-level metadata item:

"additional_display": [
        {
            "name": "SFB1451",
            "icon": "fa-solid fa-flask",
            "content": {
                "homepage": "https://github.com/allisonhorst/palmerpenguins",
                "CRC project": "INF",
                "data controller": {
                    "email": "ahorst@ucsb.edu",
                    "name": "Allison Horst"
                },
                "sample (organism)": [
                    "Adelie penguin (Pygoscelis adeliae; NCBITaxon_9238)",
                    "Gentoo penguin (Pygoscelis papua; NCBITaxon_30457)",
                    "chinstrap penguin (Pygoscelis antarcticus; NCBITaxon_79643)"
                ],
                "sample (organism part)": "body proper (UBERON_0013702)",
                "Used for": "Testing effort for DBI backends \u2014 The dataset is used as example data for testing data base backend features in automated tests (https://dbitest.r-dbi.org/)"
            }
        }
    ],

then the agent creating this metadata item could also include another property additional_display_definitions:

"additional_display_definitions": [
        {
            "name": "SFB1451",
            "keys": {
                "homepage": "https://schema.org/mainEntityOfPage",
                "CRC project": "",
                "data controller": {
                    "self": "https://w3id.org/dpv#hasDataController",
                    "email": "https://schema.org/email",
                    "name": "https://schema.org/name"
                },
                "sample (organism)": "",
                "sample (organism part)": "",
                "Used for": "http://www.w3.org/ns/prov#hadUsage"
            },
            "values": {
                "homepage": {},
                "CRC project": {},
                "data controller": {
                    "email": {},
                    "name": {}
                },
                "sample (organism)": {},
                "sample (organism part)": {},
                "Used for": {}
            }
        }
    ]

Both keys and values are included above, since either or both could have semantic definitions. But it's not expected that both or either would always be provided or necessary.

jsheunis commented 1 year ago

Relatedly, the whole context of the original metadata item before it was translated into the catalog schema could also be passed to the catalog as part of the metadata record, e.g. in the property @context. AFAIK this isn't a reserved keyword/property in jsonschema.

mslw commented 1 year ago

[Idea 2] We could add another generic key-value property to the schema, something like semantic_metadata, which would allow for passing the term definitions along with the key and value, for multiple records. This could then get a dedicated display area in a catalog's dataset page.

I think "sample (organism)" and similar are the natural candidate for this approach. Here, we were starting with either a code (NCBITaxon:9237) or an equivalent IRI (http://purl.obolibrary.org/obo/NCBITaxon_9237).

AFAIK, the current "additional display" can only show strings (or repr of an object, which is still a string). I found no way to display an URL.

FTR, "Adelie penguin (Pygoscelis adeliae; NCBITaxon_9238)" was created from "NCBITaxon:9238" by a basic request-response API query to Ontology Lookup Service while translating incoming data to catalog schema. I think catalog should not do any such queries, but it could allow displaying URLs as links.

Maybe we could have a way to pass a (specifically formatted?) object containing text and url to additional display that would be rendered as a hyperlink?

jsheunis commented 1 year ago

AFAIK, the current "additional display" can only show strings (or repr of an object, which is still a string). I found no way to display an URL.

Some of this is now addressed with https://github.com/datalad/datalad-catalog/pull/347. Although indeed, URLs will still not render as links. I agree we should get that to work.

There are two ways of approaching it:

  1. Rendering based on semantic info: IMO this is the long-term ideal. The catalog should be able derive what data type/format any value is from its semantic information, and the catalog should have standard components for rendering the relevant
  2. Rendering based on some assumptions and checks: this would be the "hacky" way, or at least the best possible way in the absence of extra information about a value. E.g. check if the field name contains url or check if the actual value contains http[s]://, and assume that it should then be rendered as a link.
jsheunis commented 1 year ago

I do however think it might be a nifty feature for a user to be able to query the definition of a term instead of (or in addition to) being presented with an uninformative link. This doesn't have to happen automatically, rather after specific user input, e.g. clicking on an info icon.

jsheunis commented 1 year ago

The following issue cannot be transferred here, but is highly applicable in that it started discussing thoughts of catalog rendering from a semantic data view: https://github.com/psychoinformatics-de/sfb1451-projects-catalog/issues/46

jsheunis commented 1 year ago

Relevant issue related to additional_display rendering, when additional display has symantic information included: https://github.com/sfb1451/tabby-utils/issues/14

jsheunis commented 5 months ago

Relevant issue related to additional_display rendering, when additional display has symantic information included: https://github.com/sfb1451/tabby-utils/issues/14

This has since been incorporated into the main branch of this repo. In addition, I think it would be useful for existing catalog instances to have a new and separate field in the dataset schema, something like "semantic properties", somewhat similar to the "relation" field used in datalad-concepts. This would be a good place for any property of the dataset that can be expressed semantically and is typically some relation, e.g. "sameas", "homepage", "study field"