BioSchemas / specifications

Issue tracker, technical wiki, and example markup
https://bioschemas.org
54 stars 52 forks source link

Discussion: how to describe distribution o a training dataset #631

Open ljgarcia opened 1 year ago

ljgarcia commented 1 year ago

The ELIXIR Machine Learning Focus Group (including the task force on synthetic data) and NFDI4DataScience (and possible RDA FAIR4ML IG) are interested in using metadata to describe the distribution of a dataset for ML training purposes (including the DOME recommendations for Data).

During the BioHackathon the subject was discussed for DOME and Synthetic Data. The current suggestion is using variableMeasured in combination with PropertyValue for any distribution/subsets of interest of this Dataset. For example attributes/features, classes (if intended for classification training), data points under each class, biological sex of the samples. For instance

Note: the measurementTechnique, unitText, value, propertyID could come from a controlled vocabulary, e.g., a DefinedTerm, which is no currently supported. A discussion about extending the coverage of DefinedTerm in ongoing

Please share your thoughts.