Add RO-Crate metadata to notebooks and repositories

wragge commented 1 year ago

As part of the HASS Community Data Lab, I'm adding RO-Crate metadata to notebooks and repositories to enable information to be harvested and used to populate a tools registry.

The basic idea is that RO-Crate metadata would be saved in a notebook's metadata, then when notebooks are pushed to a repository an action would run to extract the metadata from the notebooks and save into an RO-Crate JSON file. Some work on this has already been done as part of the GLAM Workbench repository template (thanks to ATAP).

Because I'm going to be working with existing repositories, rather than starting new ones, I'm going to have to add the necessary scripts to each repository, and find a (not too painful) way of adding the metadata to existing notebooks.

What metadata?

Metadata describing a repository, based on schema.org and RO-Crate spec:

@id
identifier - Zenodo DOI
@type (is this a RepositoryCollection?)
name
description
documention -- link to GW section
version
license
url -- GitHub

Metadata describing a notebook, based on schema.org and RO-Crate spec:

@type: ["File", "SoftwareSourceCode"] (or SoftwareApplication? or SoftwareWorkflow?)
name -- title of notebook
creators
description
programmingLanguage
runtimePlatform
softwareRequirements -- list ids of packages imported/used
codeRepository -- link to GitHub
documention -- link to GW page
encodingFormat: "application/x-ipynb+json"
input -- source of data
conformsTo - ?
about -- subjects?
keywords -- align with tags in GW?
license

I think about and keywords will take the most thought as they will be important as an access point in the context of a tool registry. Need to use/develop a controlled list?

Writing metadata to notebooks

Notebooks are just JSON, so I could just read, edit, and write them as JSON files, but might be safer to use nbformat to ensure that everything conforms with the notebook file format.

Adding metadata (Jupyter Book)
notebook file format -- metadata

I'll think I'll probably create a script to add basic metadata to notebooks, then I'll manually edit as required. Fields I could automatically populate:

@type
name -- from the title of the notebook
creators -- start with me
description -- extract first para after title?
programmingLanguage -- all Python
runtimePlatform -- can I get this from pyenv?
softwareRequirements -- get a list of Python imports, then need to map these to ids?
codeRepository -- get from git
documention -- in most cases the path will be the same as the file title
encodingFormat: "application/x-ipynb+json"
input -- mostly Trove API?
conformsTo - ?
license - all MIT

Finish off RO-Crate action

See the pull release from ATAP on the repo template -- adjust and finish off.

Later on...

Once this is done, I should change the way the documentation pages are generated to pull as much as possible from the RO-Crate metadata, so I'm not managing the same info in different places.

wragge commented 1 year ago

Make use of Tadirah for tags/subjects: https://tadirah.info/

wragge commented 1 year ago

Things to do on the RO-Crate maker GitHub action:

add top level name and description from metadata file
add top level links to GitHub repo (url) and GW section (documenation)
add broader range of props from nbs
expect nb metadata to be namespaced under rocrate
capture relationships between nbs and datasets
use nbformat to edit nb metadata

wragge commented 1 year ago

https://www.researchobject.org/ro-crate/1.1/appendix/jsonld.html

Multiple values and references can be represented using JSON arrays, as exemplified in hasPart above; however as the RO-Crate JSON-LD is in compacted form, any single-element arrays like "author": [{"@id": "#alice"}] SHOULD be unpacked to a single value like "author": {"@id": "#alice"}.

wragge commented 1 year ago

ok, after much to and fro I think I'm going to take the following approach (slightly different from the ATAP pull request).

on CookieCutter initialisation, create a basic rocrate file in the new repository with the project name, description, and creator details from the cookiecutter config file
have an update_crate.py file in the scripts directory of the repository which will gather info from notebooks and add to the crate

So update_crate.py would be run locally before any changes get pushed, not in a GitHub action on push. This enables me to make manual changes to the crate. Also update_crate.py will update, rather than replace, existing entities. This means I can automatically populate the crate with details from nbs, then enrich as necessary without losing any of these manual changes. I think this best suits my workflow.

wragge commented 10 months ago

Added to GW Repository Template: https://github.com/GLAM-Workbench/glam-workbench-template

GLAM-Workbench / glam-workbench.github.io