datalad / datalad-catalog

Create a user-friendly data catalog from structured metadata
https://datalad-catalog.netlify.app
MIT License
15 stars 12 forks source link

NF: implement design changes to separate "catalog" from "frontend/client" #253

Open jsheunis opened 1 year ago

jsheunis commented 1 year ago

The context

Let's consider the role of datalad-catalog in the context of data discoverability. It currently has two main functions:

Even though data discoverability via a browsable, user-friendly catalog is an important use for harmonized metadata, it is not necessarily the only use case, and the harmonization part should be able to happen irrespective of which frontend or client will ingest the catalog entries. E.g. it could be fed into a queryable graph-database.

On the front-end side, an important challenge that datalad-catalog needs to be able to deal with in this context is the need for different access URLs for different metadata entries. There are multiple datasets and files in a catalog and in order to allow varying degrees of access permissions (per user-dataset or user-file combination) these entries might be hosted at different locations, each with their own set of access-requirements.

The current state

The two functions of data harmonization and browsable frontend are quite inter-dependent. E.g.:

What needs to happen

jsheunis commented 1 year ago

Ideas for updating the frontend functionality.

Current implementation

A "catalog" contains all the html and JS assets for the frontend as well as the catalog entries that need to be rendered. The catalog assumes all JSON blobs of the metadata entries are located locally in the metadata directory relative to the index.html file. Metadata entries are in a hierarchy of nodes, a node could be of type dataset or directory. Here's an example of the hierarchy and a directory node displayed:

Screenshot

In the hierarchy, each node location is identified by 1) the dataset_id, 2) the dataset_version and 3) a hash of the dataset_id + dataset_version + additionally the node_path in the case of a directory-type node. The entry blob (that is fetched via GET request) is inside the file with (part of) said hash in the filename.

The workflow is:

  1. user clicks on a dataset in a catalog frontend, or clicks on a directory in the file tree of dataset
  2. javascript calculates relative location of node file
  3. blob is fetched from file
  4. blob is rendered in frontend

For purposes of rendering content in the frontend, there might also be a config.json file on the dataset level (at the location my_catalog/metadata/dataset_id/dataset_version/config.json). There is always a config.json on the catalog-level (at the location my_catalog/config.json). There's an existing and functional inheritance principle between catalog-level and dataset-level config: if not specified on dataset level during catalog entry generation, it get's inherited from catalog level.

New implementation options for front-end

Let's first write down a few assumptions about the future setup:

Option 1

The first idea is to make as little as possible changes to the current structure. Therefore:

This means that a whole metadata directory containing all node files will still be hosted with the catalog. It seems pretty similar conceptually to the git-annex way of replacing file content with symlinks, although here a file is replaced with another file which contains a URL to the actual file content.

The workflow is:

  1. user clicks on a dataset in a catalog frontend, or clicks on a directory in the file tree of dataset
  2. javascript calculates relative location of node file
  3. URL is fetched from file
  4. blob is fetched from URL (some caching or other mechanism could be used so as not to require fetching same content multiple times)
  5. blob is rendered in frontend

The implication for catalog generation:

Option 2

The second idea assumes the granularity of catalog entry distribution and access is at the dataset-level:

This means that the metadata directory that is hosted alongside the catalog only contains config files per dataset id and version.

The workflow is:

  1. user clicks on a dataset in a catalog frontend
  2. javascript fetches dataset entry's base URL from config which is hosted with html
  3. the dataset-node-file URL is encoded and blob is fetched
  4. blob is rendered in frontend
  5. user clicks on a child node in dataset file tree in frontend
  6. repeat steps 1 through 4

(again, some caching or other mechanism could be used so as not to require fetching same content multiple times)

The implication for catalog generation:

Option 3

???

Notes on the options

A note on the access granularity

jsheunis commented 3 months ago

This issue served its purpose as a discussion starter. The whole idea of separating the "catalog" (of metadata) from the "frontend" (e.g. browser-based viewer) is part of the design process of the continued work in https://github.com/psychoinformatics-de/shacl-vue and https://github.com/datalad/datalad-catalog/tree/revolution.