NF: implement design changes to separate "catalog" from "frontend/client"

The context

Let's consider the role of datalad-catalog in the context of data discoverability. It currently has two main functions:

harmonizing metadata: metadata from various sources and from different formats/schemas are translated into the catalog schema, providing a standardized set of metadata entries (the catalog)
presenting harmonized metadata in a frontend: e.g. this demo site https://datalad.github.io/datalad-catalog/#/dataset/deabeb9b-7a37-4062-a1e0-8fcef7909609/6d7fb68264f9b9951ae141fc830712a8744e3293

Even though data discoverability via a browsable, user-friendly catalog is an important use for harmonized metadata, it is not necessarily the only use case, and the harmonization part should be able to happen irrespective of which frontend or client will ingest the catalog entries. E.g. it could be fed into a queryable graph-database.

On the front-end side, an important challenge that datalad-catalog needs to be able to deal with in this context is the need for different access URLs for different metadata entries. There are multiple datasets and files in a catalog and in order to allow varying degrees of access permissions (per user-dataset or user-file combination) these entries might be hosted at different locations, each with their own set of access-requirements.

The current state

The two functions of data harmonization and browsable frontend are quite inter-dependent. E.g.:

the hierarchy of nodes that the catalog entries populate was specifically designed with data browsing in mind (although this does not necessarily mean it's a bad hierarchy for other applications, but worth inspecting)
the frontend assumes that all catalog entries are available locally at a path relative to the hosted html, i.e. there is no separate URL encoding happening

What needs to happen

see if code refactoring is necessary in order to separate the "catalog" (harmonized metadata) from "frontend/client" (HTML and JS)
encode the (base) location of a catalog metadata entry so that the frontend knows the locations to GET an entry from
keep in mind that an "access granter" component might come into play, and that the frontend might need to know when to obtain credentials from this component

Ideas for updating the frontend functionality.

Current implementation

A "catalog" contains all the html and JS assets for the frontend as well as the catalog entries that need to be rendered. The catalog assumes all JSON blobs of the metadata entries are located locally in the metadata directory relative to the index.html file. Metadata entries are in a hierarchy of nodes, a node could be of type dataset or directory. Here's an example of the hierarchy and a directory node displayed:

Screenshot

In the hierarchy, each node location is identified by 1) the dataset_id, 2) the dataset_version and 3) a hash of the dataset_id + dataset_version + additionally the node_path in the case of a directory-type node. The entry blob (that is fetched via GET request) is inside the file with (part of) said hash in the filename.

The workflow is:

user clicks on a dataset in a catalog frontend, or clicks on a directory in the file tree of dataset
javascript calculates relative location of node file
blob is fetched from file
blob is rendered in frontend

For purposes of rendering content in the frontend, there might also be a config.json file on the dataset level (at the location my_catalog/metadata/dataset_id/dataset_version/config.json). There is always a config.json on the catalog-level (at the location my_catalog/config.json). There's an existing and functional inheritance principle between catalog-level and dataset-level config: if not specified on dataset level during catalog entry generation, it get's inherited from catalog level.

New implementation options for front-end

Let's first write down a few assumptions about the future setup:

a metadata blob that's rendered in the frontend (either as a dataset or as the content of a directory in a dataset's file tree) maps directly to the concept of a node in the catalog
the blob for a catalog node could be hosted anywhere, behind access control or not
the blob for a catalog node should be obtainable with a GET request
the granularity at which catalog entry information needs to be (in)accessible from the user perspective is still unclear, and so is the effect of this on implementation details

Option 1

The first idea is to make as little as possible changes to the current structure. Therefore:

keep the hierarchy of the catalog as is, down to the level of nodes
keep the node files as they are
instead of the actual blob being inside the node file, only include the URL for where to actually get the blob from

This means that a whole metadata directory containing all node files will still be hosted with the catalog. It seems pretty similar conceptually to the git-annex way of replacing file content with symlinks, although here a file is replaced with another file which contains a URL to the actual file content.

The workflow is:

user clicks on a dataset in a catalog frontend, or clicks on a directory in the file tree of dataset
javascript calculates relative location of node file
URL is fetched from file
blob is fetched from URL (some caching or other mechanism could be used so as not to require fetching same content multiple times)
blob is rendered in frontend

The implication for catalog generation:

for an entry, one would have to generate a node-file containing the blob (as is done currently), later to be transported to the long-term location
one would have to add the long-term location to a new bare-bones node-file before/after/upon transportation, and put this new node-file in the correct location at my_catalog/metadata/dataset_id/dataset_version/node_file.json
this double file generation seems impractical

Option 2

The second idea assumes the granularity of catalog entry distribution and access is at the dataset-level:

keep the hierarchy of the catalog as is, down to the level of dataset_id/dataset_version
use the config.json on the dataset+version-level to specify the base URL of the relevant dataset-node and all related directory-nodes.

This means that the metadata directory that is hosted alongside the catalog only contains config files per dataset id and version.

The workflow is:

user clicks on a dataset in a catalog frontend
javascript fetches dataset entry's base URL from config which is hosted with html
the dataset-node-file URL is encoded and blob is fetched
blob is rendered in frontend
user clicks on a child node in dataset file tree in frontend
repeat steps 1 through 4

(again, some caching or other mechanism could be used so as not to require fetching same content multiple times)

The implication for catalog generation:

node file generation happens as is
based on the operator's discretion, parts of the metadata can be transported to a long-term location and subsequently removed from the catalog (although, importantly, the config.json files at the dataset_id/dataset_version path location should remain)
one would have to add the base URL to the dataset+version config file before/after/upon transportation

Option 3

???

Notes on the options

The eventual implementation could, of course, also be a combination of functionalities from the different options
These changes will have repercussions for how a catalog is generated and how entries are distributed:
- which process leads to the base URL being encoded in the (does someone decide beforehand where it will live, and provides it as an argument during entry generation? or are entries moved after generation? or a mixture of these?)

A note on the access granularity

to be added

datalad / datalad-catalog