Discussion: how to describe distribution o a training dataset

The ELIXIR Machine Learning Focus Group (including the task force on synthetic data) and NFDI4DataScience (and possible RDA FAIR4ML IG) are interested in using metadata to describe the distribution of a dataset for ML training purposes (including the DOME recommendations for Data).

During the BioHackathon the subject was discussed for DOME and Synthetic Data. The current suggestion is using variableMeasured in combination with PropertyValue for any distribution/subsets of interest of this Dataset. For example attributes/features, classes (if intended for classification training), data points under each class, biological sex of the samples. For instance

Data splits [{unitText: “Training”, referenceValue: {unitText: “Positive”, value: 40000}, measurementTechnique: “Splits”}, {unitText: “Validation”, referenceValue: {unitText: “Positive”, value: 5000}, measurementTechnique: “Splits”}]
- Note: The reference value refers to the classes defined (if available)
Data classes [{unitText: “Positive”, value: 75000, measurementTechnique: “Classes”}, {unitText: “Negative”, value: 15000, measurementTechnique: “Classes”}]
- Note: the full size/number of records would be needed to realize about, e.g., overlaps
Biological sex {unitText: “Biological sex”, propertyID:"http://purl.obolibrary.org/obo/PATO_0000047", value: "female"} or {unitText: “Biological sex”, propertyID:"http://purl.obolibrary.org/obo/PATO_0000047", referenceValue: {unitText: “Female”, value: 30000}}

Note: the measurementTechnique, unitText, value, propertyID could come from a controlled vocabulary, e.g., a DefinedTerm, which is no currently supported. A discussion about extending the coverage of DefinedTerm in ongoing

Please share your thoughts.

BioSchemas / specifications

Discussion: how to describe distribution o a training dataset #631