EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.
https://epistasislab.github.io/pmlb/
MIT License
805 stars 135 forks source link

Add JSON Schema file to validate `metadata.yaml` #108

Closed JDRomano2 closed 4 years ago

JDRomano2 commented 4 years ago

All datasets in PMLB are accompanied by a metadata.yaml file that describes various characteristics of the dataset. We'd like to add a schema specification that can be used to validate the metadata files for each dataset.

The current plan is to create the schema as a JSON Schema, which is (almost) fully compatible with YAML documents as described in https://json-schema-everywhere.github.io/yaml and https://stackoverflow.com/a/44837391/1730417.

The (current) template used to create metadata.yaml is as follows:

# Created by [your name and/or contact info]
dataset: # required, dataset name
description: # required, dataset description
source: # required, link to the source from where dataset was retrieved
publication: # optional, study that generated the dataset (doi, pmid, pmcid, or url)
task: # required, classification or regression
keywords: # descriptive terms for the dataset, e.g., bioinformatics, images, economics, etc.
  - keyword1 # replace this
  - keyword2 # replace this as well
target:
  type:
  description: # required, describe the endpoint/outcome (and unit if exists)
  code: # optional but recommended, coding information, e.g., 'Control' = 0, 'Case' = 1
features: # list of features in the dataset
  - name: # required, name of feature
    type: # required, either continuous, nominal or ordinal
    description: # optional but recommended, what the feature measures/indicates, unit
    code: # optional, coding information, e.g., 'Control = 0', 'Case' = 1
    transform: # optional, any transformation performed on the feature, e.g., log scaled
  - name:
    type:
    description:
    code:
    transform: