Align `datasets` to `models`

brandomr commented 1 year ago

Challenge

How can we automatically align models to datasets? Specifically, how can we most effectively align elements of a model to features within datasets.

Currently, models and datasets are profiled separately by MIT and SKEMA. Both datasets and models end up having (optional) groundings which--for each feature of the data or model--tie it to an element in the TA2 Domain Knowledge Graph (DKG). As far as I know, DKG code lives here.

For example, a model may have a compartment called infected which is grounded to

{"url":"https://bioregistry.io/vsmo:0000268","score":0.78,"prefix":"vsmo","identifier":"0000268","curie":"vsmo:0000268","name":"infected","status":"name"}

Let's say there is a dataset that has the feature infections which is grounded to

[{"url":"https://bioregistry.io/apollosv:00000114","score":0.78,"prefix":"apollosv","identifier":"00000114","curie":"apollosv:00000114","name":"infection","status":"name"},{"url":"https://bioregistry.io/ido:0000586","score":0.78,"prefix":"ido","identifier":"0000586","curie":"ido:0000586","name":"infection","status":"name"},{"url":"https://bioregistry.io/ncit:C128320","score":0.76,"prefix":"ncit","identifier":"C128320","curie":"ncit:C128320","name":"Infection","status":"name"}]

There is no intersection between these groundings, but clearly there is a relationship between infected compartment in the model and infections feature in the dataset. This makes it potentially challenging to identify relevant data to use for model calibration/simulation since for calibration you must match data to specific model compartments/elements.

Potential Solutions

Embed the groundings for both models and datasets and enable users to perform semantic search over both. This would include embedding dataset and model descriptions. When a user is search for data relevant to their model they would use free text search which would be powered by a semantic backend to surface the most useful data.
Create an /align_data_to_model endpoint which, for a given model_id attempts to find relevant data features on an model element to data feature basis. For example, an SIR model's susceptible, infected, and recovered compartments would be automatically matched and ranked to features (potentially from multiple datasets) based on groundings or whatever other information we can efficiently use.

The first approach will fit best inside TDS and is something we may want to do anyway. Vector/semantic search over content besides papers seems quite useful. We could even support semantic code search which would be potentially very useful.

The second approach will fit best inside this repository since it mirrors some of the existing endpoints (e.g. aligning a model to its paper).

Considerations

It is likely that we will need multiple examples of models and datasets for testing and development. Here is an example model which are often referred to as an AMR: ASKEM Model Representation.

Here is an example data card but note that this data card is not in the canonical dataset format for TDS. We can generate/pull some in the appropriate format--but for now at least this helps get a sense of how DKG groundings roughly appear for data.

ryanholtschneider2 commented 1 year ago

During the TA1 working group there were comments that - It would be nice to be able to try different embedding models. This is fairly easy but I can make it even easier.. The need for benchmarks - we could think of this as a really fine difficult

Some other conversation points - Is there even a good grounding and if so, how much does it help?? And if so, can we get to the grounding..? HMI workflow to to grounding usage linking to help the grounding team.. More complicated model testing..

brandomr commented 1 year ago

Implementing this as an endpoint requires the generation of embeddings over models and datasets which will first be addressed by this TDS issue so is currently blocked

DARPA-ASKEM / knowledge-middleware