Update pipeline - Githubissues

Which steps should be executed when a new dataset should be added to the catalog, or when an update is made to an existing entry.

The initial idea was described here: https://github.com/psychoinformatics-de/org/issues/264#issuecomment-1811489517

any new dataset added the the catalog should first be structured as a datalad dataset (most likely manually by us, if not an automated process) which is then added as a subdataset to abcdj-super

Updates can then follow these steps:

Add dataset as subdataset to abcdj-super, save.

Run a check on abcdj-super to detect the newly added subdataset

Run a job on the newly added subdataset after cloning:

metalad_core extraction and translation to get the id and version info

run any other extractor + translation based on some prior specification or detection of specific content

metadata record generation via tabby-utils if there are tabby files detected in the dataset

(file level metadata record generation?)

add all generated metadata records to the catalog (datalad catalog-add --catalog docs ...)

Run a job on abcdj-super, which now has a version bump after the subdataset addition.

re-extract basic metadata including studyminimeta (if decided) and metalad_core (for submodules), and translate to catalog schema

add the updated metadata records to the catalog

reset the catalog homepage to the same id but new version

On reconsideration, any new dataset entry added to the catalog should not first be structured as a datalad dataset. A counter example is if there is an existing datalad dataset for which an entry should be added to the catalog, but which will not be added as a subdataset to the abcdj-data superdataset for whichever reason decided by the data controllers. They will provide tabby files (containing the correct datalad dataset id and version), but if these are added to a new datalad dataset, this will create completely different id and version. The important part, whether the tabby files are added as part of a subdataset or not, is that we capture the correct dataset metadata and provide the necessary linkage to the superdataset.

The latter can be done by:

providing the abcdj-data superdataset metadata as tabby files
for tabby files describing a new dataset that are added directly to the superdataset, a subdataset entry has to be made in the superdataset's tabby metadata to add linkage
tabby and other metadata of the superdataset should be merged prior to adding its entry to the catalog.

There could be two options for metadata submission:

either restrict the format to text-based metadata files (tabby format for now) that are added to abcdj-data, likely in some human-identifiable subdirectory (e.g. abcdj-data/mydataset/mydataset_tabby.tsv)
or allow both text-based metadata files and subdatasets to be added to abcdj-data

I will describe catalog update steps for option 2

First some design decisions:

no dependence on datalad-metalad
no use of datalad-catalog's Translator class (to translate from metalad output to catalog schema)
decide on a number of supported "metadata extractors" (scripts or pipelines to get a specific format/type of metadata into the catalog schema):
- catalog_core.py
- tabby-utils, including getting metadata from tabby files as well as generating a tabby-compatible file listing from data tree
- a script based on bids_dataset extractor? (to be done, if so decided)
- a script based on datacite_gin extractor? (to be done, if so decided)

And then the update steps:

First check if changes have been made to abcdj-data in the form of:
- a new subdataset or updated subdataset
- new tabby files or updated tabby files
For a new subdataset:
1. Clone subdataset and ensure availability of tabby files if they exist
2. catalog_core for subdataset, new entry directly into catalog
3. load_tabby files from default location (if they exist) directly into catalog (use subdataset id and version, i.e. not from tabby files)
4. If files to be added? Run tabby-utils script to generate file listing (consider creating a catalog_filelist script to have something that is tabby-independent), add entry to catalog
5. load_tabby for superdataset, new entry directly into catalog (use subdataset id and version, i.e. not from tabby files)
6. catalog_core for superdataset, new entry directly into catalog
7. Set new superdataset id and version in catalog
For new tabby files
1. ensure local availability of tabby files
2. load_tabby files, new entry directly into catalog
3. Get id and version of new entry, add new subdataset entry to dataset table of superdataset tabby file
4. load_tabby for superdataset, new entry directly into catalog (use subdataset id and version, i.e. not from tabby files)
5. catalog_core for superdataset, new entry directly into catalog
6. Set new superdataset id and version in catalog

abcd-j / data-catalog

Update pipeline #2

I will describe catalog update steps for option 2