Closed jsheunis closed 2 months ago
What we do not want is a catalog-level mapping between names and id+versions that would need to be maintained centrally
I'm thinking of a function that adds the name (or an md5sum of the name string) as a new subdirectory in the metadata directory, and then a single file inside this new directory. The file would contain a json object of the same base format as those for datasets, directories and files, but then a new property would identify it as an object indicating that a route change is needed to the associated id and version (also contained in the object).
It could be exposed as a cli command, but could also run internally whenever a new dataset is added to a catalog.
The current routing code would need to be updated to recognize that a version was not included in the URL parameters, and handle the subsequent logic accordingly. Included in subsequent logic would be the need to distinguish between a dataset name as url parameter and a dataset id as url parameter. The former would need to follow logic proposed above. The latter would then look in the dataset ID directory if a config file is included. The suggestion here is to keep track of version numbers and dates inside the dataset-specific config file. This would need to adapt current config logic, specifically need to check whether a config file is supposed to exist and whether it still needs to be created.
Some further thoughts re using an alias to identify the dataset:
<dataset-id>/<dataset-version>
format in the url itself.<dataset-id>
parameter and nothing else, it is assumed to be an alias and handled accordingly. But that assumption is incorrect if we also want to allow the "concept link" for a dataset.a challenge that is not yet solved is how to distinguish between an alias and a dataset-id. At the moment, if the URL contains the
parameter and nothing else, it is assumed to be an alias and handled accordingly. But that assumption is incorrect if we also want to allow the "concept link" for a dataset.
I think I have a solution for this. When considering URL parameters <dataset-id>/<dataset-version>
, we can:
<dataset-id>
part<dataset-version>
optionalThen when adding a new dataset version entry, the approach outlined above for an alias should always be followed, i.e. also for a new dataset-id
, so that a dataset-id
metadata file as well as an alias
metadata file would both point to a specific dataset id
and specific dataset-version
. In addition, the dataset-id
metadata file can be used to track other (prior) versions of the same dataset
For example:
Add a dataset entry with "dataset_id": "1234", "dataset_version": "abcd"
, which will create the following tree:
.
└── 1234
├── 81dc9bdb52d04dc20036dbd8313ed055.json
├── abcd
│ └── 80c
│ └── 581361fdf77d39fbb5cbd288641d1.json
└── config.json
1234/config.json
is existing functionality1234/abcd/80c/581361fdf77d39fbb5cbd288641d1.json
is existing functionality and has the following structure: <dataset-id>/<dataset-version>/<first-3-chars-of-md5sum-of-dataset_id_dash_dataset_version>/<rest-of-the-chars-of-md5sum-of-dataset_id_dash_dataset_version.json>
1234/81dc9bdb52d04dc20036dbd8313ed055.json
is new functionality explained above, this is the "dataset-id metadata file" or "dataset concept file". There is no split into a folder-with-first-3-chars and then a shortened file name, as the folder might create conflicts with the other 2nd level folders that are reserved for full dataset-versions (abcd
in this example).The structure of the dataset concept file would be something like:
{
"type": "redirect",
"dataset_id": "1234",
"dataset_version": "abcd"
}
So, when the dataset-id
URL parameter alone is entered into the browser, the app will calculate its md5 sum and read the content of 1234/81dc9bdb52d04dc20036dbd8313ed055.json
and then reroute to 1234/abcd/80c/581361fdf77d39fbb5cbd288641d1.json
.
If both URL parameters are entered, the app will route directly to 1234/abcd/80c/581361fdf77d39fbb5cbd288641d1.json
Lastly, adding an alias for a specific dataset should add a similar structure. E.g. adding an alias mydataset
for the dataset 1234
should add the following to the existing tree:
.
├── 1234
│ ├── 81dc9bdb52d04dc20036dbd8313ed055.json
│ ├── abcd
│ │ └── 80c
│ │ └── 581361fdf77d39fbb5cbd288641d1.json
│ └── config.json
└── mydataset
└── 6e2253b987ae8237e472adebf3218366.json
with the content of the mydataset/6e2253b987ae8237e472adebf3218366.json
file being something like:
{
"type": "redirect",
"dataset_id": "1234",
}
This will redirect to the dataset-id folder and read the content of 1234/81dc9bdb52d04dc20036dbd8313ed055.json
which will then redirect to the dataset at id AND version 1234/abcd/80c/581361fdf77d39fbb5cbd288641d1.json
.
In this way, latest version for a dataset will be consider in the datalad-id metadata file, and NOT in the dataset alias file. On the point of latest version
, it might make sense to have dataset_version
(in the redirect file) be an array/object of versions and last-updated-at
datetimes, so that the code can determine automatically which one is the latest. For example:
{
"type": "redirect",
"dataset_id": "1234",
"dataset_version": {
"abcd": "2022-10-10T14:48:00"
"efgh": "2023-07-29T22:55:00"
}
}
Notes:
route.replace()
(or whatever the correct function is) so that it doesn't clog the route history like route.push()
would.dataset-version
should still be considered a URL parameter, or whether it should rather be re-implemented as a VueJS component property, i.e. something that is derived from the data and passed to the route component when its lifecycle begins. Requires some tests and more thought...
- A question that surfaces now is whether the dataset-version should still be considered a URL parameter, or whether it should rather be re-implemented as a VueJS component property, i.e. something that is derived from the data and passed to the route component when its lifecycle begins. Requires some tests and more thought...
Some points suggesting to keeping it as a URL parameter:
dataset-id/dataset-version
) if they are contained within the same catalog. It wouldn't make sense to just navigate to the subdataset's concept page and let the logic determine the latest version.
Things that would be valuable:
<catalog-url>/dataset/<dataset-id>
. This can be entered into the browser and the dataset page should select and render the latest version known to the catalog<catalog-url>/dataset/<dataset-id>/<dataset-version>
. Additionally, one can add functionality for a user to be able to select any available version for the same dataset (where the dataset page currently shows the version number, there would be a e.g. dropdown link)<catalog-url>/dataset/<dataset-name>
can be entered into the browser and the dataset page should then render the associated dataset version<catalog-url>/dataset/<dataset-id>/<dataset-version>