All datasets in PMLB are accompanied by a metadata.yaml file that describes various characteristics of the dataset. We'd like to add a schema specification that can be used to validate the metadata files for each dataset.
The (current) template used to create metadata.yaml is as follows:
# Created by [your name and/or contact info]
dataset: # required, dataset name
description: # required, dataset description
source: # required, link to the source from where dataset was retrieved
publication: # optional, study that generated the dataset (doi, pmid, pmcid, or url)
task: # required, classification or regression
keywords: # descriptive terms for the dataset, e.g., bioinformatics, images, economics, etc.
- keyword1 # replace this
- keyword2 # replace this as well
target:
type:
description: # required, describe the endpoint/outcome (and unit if exists)
code: # optional but recommended, coding information, e.g., 'Control' = 0, 'Case' = 1
features: # list of features in the dataset
- name: # required, name of feature
type: # required, either continuous, nominal or ordinal
description: # optional but recommended, what the feature measures/indicates, unit
code: # optional, coding information, e.g., 'Control = 0', 'Case' = 1
transform: # optional, any transformation performed on the feature, e.g., log scaled
- name:
type:
description:
code:
transform:
All datasets in PMLB are accompanied by a
metadata.yaml
file that describes various characteristics of the dataset. We'd like to add a schema specification that can be used to validate the metadata files for each dataset.The current plan is to create the schema as a JSON Schema, which is (almost) fully compatible with YAML documents as described in https://json-schema-everywhere.github.io/yaml and https://stackoverflow.com/a/44837391/1730417.
The (current) template used to create
metadata.yaml
is as follows: