Coleridge-Initiative / RCDatasets

Creative Commons Zero v1.0 Universal
3 stars 2 forks source link

RCDatasets

This repo provides the datasets.json file, used as "ground truth" for the knowledge graph work in ADRF and Rich Context.

For a diagram of how this dataset list fits within the overall ETL workflow used to update the knowledge graph, see the OmniGraffle source at docs/kg_etl_workflow.graffle in this repo.

Managing Updates

Having a separate repo helps us manage changes carefully. This is metadata not data, so serves it as the basis for linking. That requires auditing of any changes, to avoid breaking links in the graph downstream from any update.

Consequently, each update must be handled through a pull request and audited in a code review.

  1. work in a separate branch and update from master
  2. look for other PRs (work in progress) and note the IDs used
  3. request a range of up to 5 IDs on the rich_context channel on Slack
  4. make edits in your branch
  5. confirm through unit tests: python test.py

At that point, create a PR and have someone else on the team review it.

Also, don't commit code here except for consistency checks used on the dataset list itself.

Required Fields

At a minimum, each record in the datasets.json file must have these required fields:

For the names, use what the data provider shows on their web page and try to be as consise as possible.

When adding records:

Other fields that may be included:

To Do

quality checks on dataset entries

Additions to test.py

Enrich datasets.json with additional metadata

The datasets enumerated in datasets.json may have additional metadata, which would be given to us by the data provider or client using the dataset.

These fields might include (but not limited to):