datalad / datalad-catalog

Create a user-friendly data catalog from structured metadata
https://datalad-catalog.netlify.app
MIT License
14 stars 12 forks source link

Ability to identify dataset based on ID and alias only #423

Closed jsheunis closed 2 months ago

jsheunis commented 4 months ago

Things that would be valuable:

jsheunis commented 4 months ago

Regarding the identification based on a dataset name

What we do not want is a catalog-level mapping between names and id+versions that would need to be maintained centrally

I'm thinking of a function that adds the name (or an md5sum of the name string) as a new subdirectory in the metadata directory, and then a single file inside this new directory. The file would contain a json object of the same base format as those for datasets, directories and files, but then a new property would identify it as an object indicating that a route change is needed to the associated id and version (also contained in the object).

It could be exposed as a cli command, but could also run internally whenever a new dataset is added to a catalog.

Regarding the identification based on a dataset ID alone

The current routing code would need to be updated to recognize that a version was not included in the URL parameters, and handle the subsequent logic accordingly. Included in subsequent logic would be the need to distinguish between a dataset name as url parameter and a dataset id as url parameter. The former would need to follow logic proposed above. The latter would then look in the dataset ID directory if a config file is included. The suggestion here is to keep track of version numbers and dates inside the dataset-specific config file. This would need to adapt current config logic, specifically need to check whether a config file is supposed to exist and whether it still needs to be created.

jsheunis commented 4 months ago

Some further thoughts re using an alias to identify the dataset:

jsheunis commented 4 months ago

a challenge that is not yet solved is how to distinguish between an alias and a dataset-id. At the moment, if the URL contains the parameter and nothing else, it is assumed to be an alias and handled accordingly. But that assumption is incorrect if we also want to allow the "concept link" for a dataset.

I think I have a solution for this. When considering URL parameters <dataset-id>/<dataset-version>, we can:

Then when adding a new dataset version entry, the approach outlined above for an alias should always be followed, i.e. also for a new dataset-id, so that a dataset-id metadata file as well as an alias metadata file would both point to a specific dataset id and specific dataset-version. In addition, the dataset-id metadata file can be used to track other (prior) versions of the same dataset

For example:

Add a dataset entry with "dataset_id": "1234", "dataset_version": "abcd", which will create the following tree:

.
└── 1234
    ├── 81dc9bdb52d04dc20036dbd8313ed055.json
    ├── abcd
    │   └── 80c
    │       └── 581361fdf77d39fbb5cbd288641d1.json
    └── config.json

The structure of the dataset concept file would be something like:

{
    "type": "redirect",
    "dataset_id": "1234",
    "dataset_version": "abcd"
}

So, when the dataset-id URL parameter alone is entered into the browser, the app will calculate its md5 sum and read the content of 1234/81dc9bdb52d04dc20036dbd8313ed055.json and then reroute to 1234/abcd/80c/581361fdf77d39fbb5cbd288641d1.json.

If both URL parameters are entered, the app will route directly to 1234/abcd/80c/581361fdf77d39fbb5cbd288641d1.json

Lastly, adding an alias for a specific dataset should add a similar structure. E.g. adding an alias mydataset for the dataset 1234 should add the following to the existing tree:

.
├── 1234
│   ├── 81dc9bdb52d04dc20036dbd8313ed055.json
│   ├── abcd
│   │   └── 80c
│   │       └── 581361fdf77d39fbb5cbd288641d1.json
│   └── config.json
└── mydataset
    └── 6e2253b987ae8237e472adebf3218366.json

with the content of the mydataset/6e2253b987ae8237e472adebf3218366.json file being something like:

{
    "type": "redirect",
    "dataset_id": "1234",
}

This will redirect to the dataset-id folder and read the content of 1234/81dc9bdb52d04dc20036dbd8313ed055.json which will then redirect to the dataset at id AND version 1234/abcd/80c/581361fdf77d39fbb5cbd288641d1.json.

In this way, latest version for a dataset will be consider in the datalad-id metadata file, and NOT in the dataset alias file. On the point of latest version, it might make sense to have dataset_version (in the redirect file) be an array/object of versions and last-updated-at datetimes, so that the code can determine automatically which one is the latest. For example:

{
    "type": "redirect",
    "dataset_id": "1234",
    "dataset_version":  {
       "abcd": "2022-10-10T14:48:00"
       "efgh": "2023-07-29T22:55:00"
    }
}

Notes:

jsheunis commented 4 months ago
  • A question that surfaces now is whether the dataset-version should still be considered a URL parameter, or whether it should rather be re-implemented as a VueJS component property, i.e. something that is derived from the data and passed to the route component when its lifecycle begins. Requires some tests and more thought...

Some points suggesting to keeping it as a URL parameter: