HDRUK / schemata

HDR UK Schemas
https://hdruk.github.io/schemata/
Apache License 2.0
11 stars 4 forks source link
hdruk metadata schema

generate-markdown

HDR UK Schemata - Dataset V2.1

1. HDR UK Dataset Schema - YAML - JSON

The latest version specification required for datasets to be on boarded onto the Gateway are shown in this repository and is comprised of the following:

2. Dataset Properties Breakdown

Below is the breakdown of the HDR UK V2 Dataset Schema by its properties and sub properties as defined in the JSON Schema. Each property from 1-7 has its own Schema with a description of its corresponding sub properties, including their data type and whether it is a required field.

0. Metadata: Properties generated when dataset is entered into the system.

1. summary: Summary metadata must be completed by Data Custodians onboarding metadata into the Innovation Gateway MVP.

2. documentation: Documentation can include a rich text description of the dataset or links to media such as documents, images, presentations, videos or links to data dictionaries, profiles or dashboards. Organisations are required to confirm that they have permission to distribute any additional media.

3. coverage: This information includes attributes for geographical and temporal coverage, cohort details etc. to enable a deeper understanding of the dataset content so that researchers can make decisions about the relevance of the underlying data.

4. provenance: Provenance information allows researchers to understand data within the context of its origins and can be an indicator of quality, authenticity and timeliness.

5. accessibility: Accessibility information allows researchers to understand access, usage, limitations, formats, standards and linkage or interoperability with toolsets.

6. enrichmentAndLinkage: This section includes information about related datasets that may have previously been linked, as well as indicating if there is the opportunity to link to other datasets in the future. If a dataset has been enriched and/or derivations, scores and existing tools are available this section allows providers to indicate this to researchers.

7. observations: Multiple observations about the dataset may be provided and users are expected to provide at least one observation (1..*). We will be supporting the schema.org observation model (https://schema.org/Observation) with default values. Users will be encouraged to provide their own statistical populations as the project progresses.

8. structuralMetadata: Descriptions and details about the tables and columns within a dataset.

3. Metadata Quality Scoring

Once a dataset is onboarded onto the Gateway, a quality check is run on its corresponding json schema to produce a weighted quality score based on weighted field completeness and weighted field error percentage. Weights of each field can be found here (https://github.com/HDRUK/datasets/tree/master/config/weights) and details of the quality score calculation can be found here (https://github.com/HDRUK/datasets/tree/master/reports#how-scores-are-calculated).

Based on the weighted quality score, a dataset is given a medallion rating as follows: