Closed hoffmannmic closed 2 years ago
Thought 1: Use Ontologies and TripleStore DB
Example: Semantic concept schema of the linear mixed model of experimental observations (https://doi.org/10.1038/s41597-020-0409-7)
"In this paper, we propose a semantic model for the statistical analysis of datasets by linear mixed models. We tie together disparate statistical concepts in an interdisciplinary context through the application of ontologies, in particular the Statistics Ontology (StatO), to produce FaIR data summaries. "
Here the STATO ontology (STATistics Ontology, http://stato-ontology.org/) is extended.
Example: Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering (DOI: 10.1109/WORKS49585.2019.00006)
"If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stackholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle, while keeping the provenance capture overhead low. "
Here the data represented using W3C PROV / PROV-ML.
Thought 2: Using a Pipeline Framework
I would argue that we might need both approaches. The first one allows to store general metadata so that you can perform a query on the specific calibration process you have performed (assume you have done many different calibrations, e.g. using different priors, different sensors, different queries/experiments) - we need a unique description to distinguish all those in order to put them back into the database (and being able to query them afterwards). However, I think it is just not possible to include everything (in particular just storing the query is not sufficient since the database itself might change, thus returning different results). Thus we would have to store the complete process in a workflow system that allows to document every single input/output in the complete workflow (and even within the modules of the workflow).
Maybe we should first start with asking ourselves, what information we would like to query afterwards. I created an entry in the wiki as a basis for our discussion today https://github.com/BAMresearch/ModelCalibration/wiki
Outdated due to changes in the focus towards lebedigital
The issue is how to document/store information about the parametrization process of the computer model. There different types of information: