Open brandomr opened 1 year ago
During the TA1 working group there were comments that - It would be nice to be able to try different embedding models. This is fairly easy but I can make it even easier.. The need for benchmarks - we could think of this as a really fine difficult
Some other conversation points - Is there even a good grounding and if so, how much does it help?? And if so, can we get to the grounding..? HMI workflow to to grounding usage linking to help the grounding team.. More complicated model testing..
Implementing this as an endpoint requires the generation of embeddings over models and datasets which will first be addressed by this TDS issue so is currently blocked
Challenge
How can we automatically align models to datasets? Specifically, how can we most effectively align elements of a model to features within datasets.
Currently, models and datasets are profiled separately by MIT and SKEMA. Both datasets and models end up having (optional)
groundings
which--for each feature of the data or model--tie it to an element in the TA2 Domain Knowledge Graph (DKG). As far as I know, DKG code lives here.For example, a model may have a compartment called
infected
which is grounded toLet's say there is a dataset that has the feature
infections
which is grounded toThere is no intersection between these groundings, but clearly there is a relationship between
infected
compartment in the model andinfections
feature in the dataset. This makes it potentially challenging to identify relevant data to use for model calibration/simulation since for calibration you must match data to specific model compartments/elements.Potential Solutions
/align_data_to_model
endpoint which, for a givenmodel_id
attempts to find relevant data features on an model element to data feature basis. For example, an SIR model'ssusceptible, infected, and recovered
compartments would be automatically matched and ranked to features (potentially from multiple datasets) based on groundings or whatever other information we can efficiently use.The first approach will fit best inside TDS and is something we may want to do anyway. Vector/semantic search over content besides papers seems quite useful. We could even support semantic code search which would be potentially very useful.
The second approach will fit best inside this repository since it mirrors some of the existing endpoints (e.g. aligning a model to its paper).
Considerations
It is likely that we will need multiple examples of
models
anddatasets
for testing and development. Here is an example model which are often referred to as anAMR
: ASKEM Model Representation.Here is an example data card but note that this data card is not in the canonical dataset format for TDS. We can generate/pull some in the appropriate format--but for now at least this helps get a sense of how DKG groundings roughly appear for data.