datalad / datalad-catalog

Create a user-friendly data catalog from structured metadata
https://datalad-catalog.netlify.app
MIT License
14 stars 12 forks source link

Instructions for creating a catalog from files with catalog-ready metadata #311

Closed jsheunis closed 1 month ago

jsheunis commented 1 year ago
  1. Install datalad-catalog main branch
  2. Create an empty catalog: datalad catalog create -c <path-to-catalog-dir>
    • this uses the default catalog config shipped with the package, which might not be ideal for any specific catalog, but should be fine as a starting point
  3. Create a dataset-level metadata file with json lines
  4. Create a file-level metadata file with json lines
    • each JSON line can be in the following format: (link to be posted)
    • minimum required fields: type, dataset_id, dataset_version, path
    • type is file
    • path is path of file relative to parent dataset
  5. Add dataset metadata to catalog: datalad catalog add -c <path-to-catalog-dir> -m <path-to-dataset-metadata-file>
  6. Add file metadata to catalog: datalad catalog add -c <path-to-catalog-dir> -m <path-to-file-metadata-file>
  7. Set catalog superdataset, i.e. homepage: datalad catalog set-super -c <path-to-catalog-dir> -i <id-of-super> -v <version-of-super>
  8. Publish catalog to somewhere.
jsheunis commented 1 year ago

Demo:

Example content of dataset_metadata.jsonl:

{ "type": "dataset", "dataset_id": "1234", "dataset_version": "latest", "name": "Demo", "description": "This is a dataset description", "authors": [ { "name": "Stephan Heunis" }, { "name": "Michael Hanke" } ], "keywords": [ "minimal", "example", "catalog", "from", "metadata" ], "subdatasets": [ { "dataset_id": "5678", "dataset_version": "latest", "dataset_path": "mysubdataset" } ], "top_display": [ { "name": "Storage", "value": "7PB" }, { "name": "Source", "value": "Open" } ] }
{ "type": "dataset", "dataset_id": "5678", "dataset_version": "latest", "name": "Demo subdataset", "description": "This is a SUBdataset description", "authors": [ { "name": "Stephan Heunis 2" }, { "name": "Michael Hanke 2" } ], "keywords": [ "subdubdub"] }

Example content of file_metadata.jsonl:

{ "type": "file", "dataset_id": "1234", "dataset_version": "latest", "path": "myfile.txt", "contentbytesize": 12345, "url": "https://github.com/"}
{ "type": "file", "dataset_id": "1234", "dataset_version": "latest", "path": "subdir/my2ndfile.txt", "contentbytesize": 99345, "url": "https://github.com/"}
{ "type": "file", "dataset_id": "5678", "dataset_version": "latest", "path": "mysubdatasetfile.txt", "contentbytesize": 666666, "url": "https://github.com/"}
{ "type": "file", "dataset_id": "5678", "dataset_version": "latest", "path": "subbydirry/fubar.txt", "contentbytesize": 11111111, "url": "https://github.com/"}

Run the commands:

> datalad catalog create -c Desktop/mycatalog
catalog_create(ok): Desktop/mycatalog [Catalog successfully created at: Desktop/mycatalog]

> datalad catalog add -c Desktop/mycatalog -m Desktop/dataset_metadata.jsonl
catalog_add(ok): Desktop/mycatalog [Metadata items successfully added to catalog]

> datalad catalog add -c Desktop/mycatalog -m Desktop/file_metadata.jsonl
catalog_add(ok): Desktop/mycatalog [Metadata items successfully added to catalog]

> datalad catalog set-super -c Desktop/mycatalog -i 1234 -v latest
catalog_set_super(ok): /Users/jsheunis [Superdataset successfully set for catalog]

> datalad catalog serve -c Desktop/mycatalog
...

Resulting catalog:

https://github.com/datalad/datalad-catalog/assets/10141237/5d7df906-8b10-4eb8-a964-372c3e0bfd12

Comments:

mih commented 1 year ago

path is path of file relative to parent dataset

We should clarify what conventions this has to be in. I assume POSIX.

mih commented 1 year ago

I can confirm that this works for me.

jsheunis commented 1 month ago

This issue served its purpose a while back already. Compared to the current state in main, the commands are now outdated. Closing.