any new dataset added the the catalog should first be structured as a datalad dataset (most likely manually by us, if not an automated process) which is then added as a subdataset to abcdj-super
Updates can then follow these steps:
Add dataset as subdataset to abcdj-super, save.
Run a check on abcdj-super to detect the newly added subdataset
Run a job on the newly added subdataset after cloning:
metalad_core extraction and translation to get the id and version info
run any other extractor + translation based on some prior specification or detection of specific content
metadata record generation via tabby-utils if there are tabby files detected in the dataset
(file level metadata record generation?)
add all generated metadata records to the catalog (datalad catalog-add --catalog docs ...)
Run a job on abcdj-super, which now has a version bump after the subdataset addition.
re-extract basic metadata including studyminimeta (if decided) and metalad_core (for submodules), and translate to catalog schema
add the updated metadata records to the catalog
reset the catalog homepage to the same id but new version
On reconsideration, any new dataset entry added to the catalog should not first be structured as a datalad dataset. A counter example is if there is an existing datalad dataset for which an entry should be added to the catalog, but which will not be added as a subdataset to the abcdj-data superdataset for whichever reason decided by the data controllers. They will provide tabby files (containing the correct datalad dataset id and version), but if these are added to a new datalad dataset, this will create completely different id and version. The important part, whether the tabby files are added as part of a subdataset or not, is that we capture the correct dataset metadata and provide the necessary linkage to the superdataset.
The latter can be done by:
providing the abcdj-data superdataset metadata as tabby files
for tabby files describing a new dataset that are added directly to the superdataset, a subdataset entry has to be made in the superdataset's tabby metadata to add linkage
tabby and other metadata of the superdataset should be merged prior to adding its entry to the catalog.
There could be two options for metadata submission:
either restrict the format to text-based metadata files (tabby format for now) that are added to abcdj-data, likely in some human-identifiable subdirectory (e.g. abcdj-data/mydataset/mydataset_tabby.tsv)
or allow both text-based metadata files and subdatasets to be added to abcdj-data
I will describe catalog update steps for option 2
First some design decisions:
no dependence on datalad-metalad
no use of datalad-catalog's Translator class (to translate from metalad output to catalog schema)
decide on a number of supported "metadata extractors" (scripts or pipelines to get a specific format/type of metadata into the catalog schema):
tabby-utils, including getting metadata from tabby files as well as generating a tabby-compatible file listing from data tree
a script based on bids_dataset extractor? (to be done, if so decided)
a script based on datacite_gin extractor? (to be done, if so decided)
And then the update steps:
First check if changes have been made to abcdj-data in the form of:
a new subdataset or updated subdataset
new tabby files or updated tabby files
For a new subdataset:
Clone subdataset and ensure availability of tabby files if they exist
catalog_core for subdataset, new entry directly into catalog
load_tabby files from default location (if they exist) directly into catalog (use subdataset id and version, i.e. not from tabby files)
If files to be added? Run tabby-utils script to generate file listing (consider creating a catalog_filelist script to have something that is tabby-independent), add entry to catalog
load_tabby for superdataset, new entry directly into catalog (use subdataset id and version, i.e. not from tabby files)
catalog_core for superdataset, new entry directly into catalog
Set new superdataset id and version in catalog
For new tabby files
ensure local availability of tabby files
load_tabby files, new entry directly into catalog
Get id and version of new entry, add new subdataset entry to dataset table of superdataset tabby file
load_tabby for superdataset, new entry directly into catalog (use subdataset id and version, i.e. not from tabby files)
catalog_core for superdataset, new entry directly into catalog
Which steps should be executed when a new dataset should be added to the catalog, or when an update is made to an existing entry.
The initial idea was described here: https://github.com/psychoinformatics-de/org/issues/264#issuecomment-1811489517
On reconsideration, any new dataset entry added to the catalog should not first be structured as a datalad dataset. A counter example is if there is an existing datalad dataset for which an entry should be added to the catalog, but which will not be added as a subdataset to the
abcdj-data
superdataset for whichever reason decided by the data controllers. They will provide tabby files (containing the correct datalad dataset id and version), but if these are added to a new datalad dataset, this will create completely different id and version. The important part, whether the tabby files are added as part of a subdataset or not, is that we capture the correct dataset metadata and provide the necessary linkage to the superdataset.The latter can be done by:
abcdj-data
superdataset metadata as tabby filesThere could be two options for metadata submission:
abcdj-data
, likely in some human-identifiable subdirectory (e.g.abcdj-data/mydataset/mydataset_tabby.tsv
)abcdj-data
I will describe catalog update steps for option 2
First some design decisions:
datalad-metalad
datalad-catalog
'sTranslator
class (to translate from metalad output to catalog schema)catalog_core.py
tabby-utils
, including getting metadata from tabby files as well as generating a tabby-compatible file listing from data treebids_dataset
extractor? (to be done, if so decided)datacite_gin
extractor? (to be done, if so decided)And then the update steps:
abcdj-data
in the form of:catalog_core
for subdataset, new entry directly into catalogload_tabby
files from default location (if they exist) directly into catalog (use subdataset id and version, i.e. not from tabby files)tabby-utils
script to generate file listing (consider creating acatalog_filelist
script to have something that is tabby-independent), add entry to catalogload_tabby
for superdataset, new entry directly into catalog (use subdataset id and version, i.e. not from tabby files)catalog_core
for superdataset, new entry directly into catalogload_tabby
files, new entry directly into catalogdataset
table of superdataset tabby fileload_tabby
for superdataset, new entry directly into catalog (use subdataset id and version, i.e. not from tabby files)catalog_core
for superdataset, new entry directly into catalog