datalad / datalad-catalog

Create a user-friendly data catalog from structured metadata
https://datalad-catalog.netlify.app
MIT License
14 stars 12 forks source link

A catalog metadata source format to support automatic ingestion #482

Open jsheunis opened 5 days ago

jsheunis commented 5 days ago

Context: https://github.com/psychoinformatics-de/org/pull/310

There are currently multiple catalog instances in production (ABCD-J, SFB1451, demo catalog, Public nEUro) that have heterogeneous maintenance workflows, i.e. different ways of providing and transforming metadata into a state that existing datalad-catalog commands can handle. This is not ideal.

To improve this situation, we can create, document, and publish a specification for a datalad-catalog compatible collection of dataset records in a well-defined format.

This will:

After initial discussion, the following structure was produced:

- catalog.json: (do versioned? e.g., `config/v1/...`)
- records/
  - <name-id>/
    - config.json
    - <version-id>/
      - ...<format-id>...
      - ...<format-id>...

These would be standalone "dataset-version" metadata records living in the presented structure on a file system, with a top-level configuration that supports per-catalog customizations. Metadata records may be in various formats (e.g. ScientificDataset YAML, and tabby XLSX), i.e. the specification relates to structure and not to file format or content.

TODO

jsheunis commented 4 days ago

Been reading through the existing documentation and I think the best candidate for placing this new addition would be the Pipeline Description section, which describes a functioning but outdated view of generating a catalog entry from a datalad dataset using metalad and catalog translators. I think that whole page can be rewritten with the focus being the content proposed the current issue.

Afterwards, we should also revamp/update the Metadata and datalad-catalog page to become in line with the metadata source description.