abcd-j / data-catalog

https://data.abcd-j.de
0 stars 1 forks source link

How to structure scripts for catalog entry generation #3

Closed jsheunis closed 9 months ago

jsheunis commented 1 year ago

In https://github.com/abcd-j/data-catalog/issues/2 I proposed design decisions:

  • no dependence on datalad-metalad
  • no use of datalad-catalog's Translator class (to translate from metalad output to catalog schema)
  • decide on a number of supported "metadata extractors" (scripts or pipelines to get a specific format/type of metadata into the catalog schema):
    • catalog_core.py
    • tabby-utils, including getting metadata from tabby files as well as generating a tabby-compatible file listing from data tree
    • a script based on bids_dataset extractor? (to be done, if so decided)
    • a script based on datacite_gin extractor? (to be done, if so decided)

and also:

consider creating a catalog_filelist script to have something that is tabby-independent

The goals can be summarized as is:

  1. minimize dependencies as much as possible (especially with regards to tools that will be deprecated or changed in future)
  2. include functionality within datalad-catalog to generate the type of metadata (dataset- and file-level) that a catalog would often need (given current tech stack and its future outlook)
  3. include functionality within abcd-j/data-catalog/code that will generate metadata that is more specific to the abcd-j catalog

Metadata that a catalog would often need (goal 2)

For datalad datasets:

For non-datalad datasets:

Metadata that is more specific to the abcd-j catalog (goal 3)

Tabby tabby tabby:

jsheunis commented 1 year ago

One thing that I am still uncertain about is how to incorporate functionality from tabby-utils. Some of it is more generic functionality (e.g. loading and related helpers), and some of it is sf1451-specific (e.g. translation from sfb1451 tabby convention to catalog schema). I will still look at the code in more detail, but @mslw if you have thoughts about this please share.

mslw commented 1 year ago

Regarding tabby-utils, I hope that you will find the code reasonably modular (at least it was in my mind...), but I have to admit that some expectations about the input, which may or may not hold true for abcdj, are expressed in code. This was, after all, sfb1451/tabby-utils (with datalad-tabby being the general purpose thing). For abcd-j, I'm afraid an independent fork, or maybe a rewrite with some borrowing, may be the most optimal course - but I'd be really happy if you were able to look into the code and form your own impression.

jsheunis commented 9 months ago

This has been addressed by "a rewrite with some borrowing", although "a lot" rather than "some".