abcd-j / data-catalog

https://data.abcd-j.de
0 stars 1 forks source link

Update pipeline #2

Open jsheunis opened 9 months ago

jsheunis commented 9 months ago

Which steps should be executed when a new dataset should be added to the catalog, or when an update is made to an existing entry.

The initial idea was described here: https://github.com/psychoinformatics-de/org/issues/264#issuecomment-1811489517

any new dataset added the the catalog should first be structured as a datalad dataset (most likely manually by us, if not an automated process) which is then added as a subdataset to abcdj-super

Updates can then follow these steps:

  1. Add dataset as subdataset to abcdj-super, save.
  2. Run a check on abcdj-super to detect the newly added subdataset
  3. Run a job on the newly added subdataset after cloning:
    • metalad_core extraction and translation to get the id and version info
    • run any other extractor + translation based on some prior specification or detection of specific content
    • metadata record generation via tabby-utils if there are tabby files detected in the dataset
    • (file level metadata record generation?)
    • add all generated metadata records to the catalog (datalad catalog-add --catalog docs ...)
  4. Run a job on abcdj-super, which now has a version bump after the subdataset addition.
    • re-extract basic metadata including studyminimeta (if decided) and metalad_core (for submodules), and translate to catalog schema
    • add the updated metadata records to the catalog
    • reset the catalog homepage to the same id but new version

On reconsideration, any new dataset entry added to the catalog should not first be structured as a datalad dataset. A counter example is if there is an existing datalad dataset for which an entry should be added to the catalog, but which will not be added as a subdataset to the abcdj-data superdataset for whichever reason decided by the data controllers. They will provide tabby files (containing the correct datalad dataset id and version), but if these are added to a new datalad dataset, this will create completely different id and version. The important part, whether the tabby files are added as part of a subdataset or not, is that we capture the correct dataset metadata and provide the necessary linkage to the superdataset.

The latter can be done by:

There could be two options for metadata submission:

  1. either restrict the format to text-based metadata files (tabby format for now) that are added to abcdj-data, likely in some human-identifiable subdirectory (e.g. abcdj-data/mydataset/mydataset_tabby.tsv)
  2. or allow both text-based metadata files and subdatasets to be added to abcdj-data

I will describe catalog update steps for option 2

First some design decisions:

And then the update steps:

  1. First check if changes have been made to abcdj-data in the form of:
    • a new subdataset or updated subdataset
    • new tabby files or updated tabby files
  2. For a new subdataset:
    1. Clone subdataset and ensure availability of tabby files if they exist
    2. catalog_core for subdataset, new entry directly into catalog
    3. load_tabby files from default location (if they exist) directly into catalog (use subdataset id and version, i.e. not from tabby files)
    4. If files to be added? Run tabby-utils script to generate file listing (consider creating a catalog_filelist script to have something that is tabby-independent), add entry to catalog
    5. load_tabby for superdataset, new entry directly into catalog (use subdataset id and version, i.e. not from tabby files)
    6. catalog_core for superdataset, new entry directly into catalog
    7. Set new superdataset id and version in catalog
  3. For new tabby files
    1. ensure local availability of tabby files
    2. load_tabby files, new entry directly into catalog
    3. Get id and version of new entry, add new subdataset entry to dataset table of superdataset tabby file
    4. load_tabby for superdataset, new entry directly into catalog (use subdataset id and version, i.e. not from tabby files)
    5. catalog_core for superdataset, new entry directly into catalog
    6. Set new superdataset id and version in catalog