Closed jsheunis closed 9 months ago
One thing that I am still uncertain about is how to incorporate functionality from tabby-utils
. Some of it is more generic functionality (e.g. loading and related helpers), and some of it is sf1451
-specific (e.g. translation from sfb1451
tabby convention to catalog schema). I will still look at the code in more detail, but @mslw if you have thoughts about this please share.
Regarding tabby-utils, I hope that you will find the code reasonably modular (at least it was in my mind...), but I have to admit that some expectations about the input, which may or may not hold true for abcdj, are expressed in code. This was, after all, sfb1451/tabby-utils (with datalad-tabby being the general purpose thing). For abcd-j, I'm afraid an independent fork, or maybe a rewrite with some borrowing, may be the most optimal course - but I'd be really happy if you were able to look into the code and form your own impression.
This has been addressed by "a rewrite with some borrowing", although "a lot" rather than "some".
In https://github.com/abcd-j/data-catalog/issues/2 I proposed design decisions:
and also:
The goals can be summarized as is:
datalad-catalog
to generate the type of metadata (dataset- and file-level) that a catalog would often need (given current tech stack and its future outlook)abcd-j/data-catalog/code
that will generate metadata that is more specific to the abcd-j catalogMetadata that a catalog would often need (goal 2)
For datalad datasets:
metalad_core
extractor usingdatalad-metalad
-> now implemented as a script indatalad-catalog
that operates on a datalad dataset and outputs a catalog-ready record: https://github.com/datalad/datalad-catalog/blob/abcdj/datalad_catalog/extractors/catalog_core.pymetalad_core
extractor andmeta_conduct
usingdatalad-metalad
-> idea is to have acatalog_filelist.py
extractor script that usesdatalad status
output and produces a list of files in a catalog-readyFor non-datalad datasets:
catalog_filelist.py
script, but functionality should be added to tell the script to operate on non-datalad directory (either gitworktree or just a standard filesystem)Metadata that is more specific to the abcd-j catalog (goal 3)
Tabby tabby tabby:
abcd-j
-specific is because this will use anabcd-j
-specific tabby convention (and in the more general case: anabcd-j
-specific schema implemented with LinkML) and this influences how the tabby-based metadata is translated into the catalog schema (for now while the catalog )datalad-catalog
, i.e. relevant functionality will move there.