Enrich example models, datasets before importing in terarium

liunelson commented 1 week ago

For this task, Julian will need to know what each field in a model AMR means so that he can properly instruct an agent to extract and assign what to where.

The JSON schema for a PetriNet model AMR is here:

In addition to the schema, here's a description of most fields:

{
  "header": {
    "name": "Name of the model",
    "schema": "url to the JSON schema",
    "description": "Long-form description of the model",
    "schema_name": "petrinet",
    "model_version": "0.1"
  },
  "model": {
    "states": [
      {
        "id": "Shortest, unique symbol or name representing a state variable of the model, compatible with Sympy",
        "name": "Natural language name of the variable",
        "description": "Long-form description",
        "grounding": {
          "identifiers": "Dict, key-value pair = namespace-identifier pair, entry in MIRA domain knowledge graph at http://34.230.33.149:8771/"
        },
        "units": {
          "expression": "unit of measurement of the variable",
          "expression_mathml": "ditto but in MathML"
        },
      }, ...
    ]
    "transitions": [
      {
        "id": "UUID of a transition/process of the model",
        "input": "list of state variable UUIDs that are the inputs of this transition",
        "output": "list of state variable UUIDs that are the outputs of this transition",
        "properties": {
          "name": "Natural language name of this transition",
          "description": "Long-form description"
        }
      }, ...
    ],
},
  "semantics": {
    "ode": {
      "rates": [
        {
          "target": "UUID of the model transition for which this is the rate function",
          "expression": "Sympy expression of this rate function",
          "expression_mathml": "MathML equivalent of the sympy expression of the rate function"
        }, ...
      ],
      "initials": [
        {
          "target": "UUID of the model variable for which this is the initial condition or value at time = 0",
          "expression": "sympy expression of this initial condition",
          "expression_mathml": "MathML equivalent of the sympy expression of this initial condition"
        }, ...
      ],
      "parameters": [
        {
          "id": "Shortest, unique symbol or name representing a parameter of the model, compatible with Sympy",
          "name": "Natural language name of this parameter",
          "description": "long-form description of this parameter",
          "units": {
            "expression": "Sympy expression of the unit of measurement",
            "expression_mathml": "MathML equivalent of the Sympy expression"
          },
          "value": "floating-point number, expectation value of this parameter",
          "distribution": {
            "type": "identifier of a probability distribution in the probonto ontology, https://github.com/gyorilab/mira/blob/e468059089681c7cd457acc51821b5bd1074df04/mira/dkg/resources/probonto.json",
            "parameters": "Dict, key-value pair = name-value pair, parameters of the probability distribution as in probonto.json"
          }
        }, ...
      ],
      "observables": [
        {
          "id": "Short, unique symbol or name of an observable of the model",
          "name": "Natural language name",
          "states": "List of the `id` of the state variables used in the math expression of this observable",
          "expression": "Sympy expression of this observable",
          "expression_mathml": "MathML equivalent of the Sympy expression"
        }
      ],
      "time": {
        "id": "t",
        "units": {
          "expression": "Sympy expression of the unit of measurement for time in the model",
          "expression_mathml": "MathML equivalent"
        }
      }
    }
  },
  "metadata": {"This object stores TA1/MIT extractions (https://github.com/DARPA-ASKEM/Model-Representations/blob/main/metadata_schema.json) and MIRA annotations (https://github.com/DARPA-ASKEM/experiments/blob/main/thin-thread-examples/mira_v2/biomodels/BIOMD0000000955/model_askenet.json#L597), use MIRA schema to ensure that the content gets carried through MIRA operations."}
}

j2whiting commented 1 week ago

@liunelson We will also need:

[ ] datasets for demo
[ ] description about what needs to be enriched in the dataset, in addition to basic data types & csv format

liunelson commented 1 week ago

@mwdchang got us an example of a dataset card that Terarium expects: https://github.com/DARPA-ASKEM/experiments/blob/nliu-funman/python_sandbox/notebooks/data/monthly_demo_july/dataset-card-example.json

Here are six datasets in decreasing order of priority.

Dataset 1: California Department of Public Health, Cases Deaths Tests

CSV file: https://data.chhs.ca.gov/dataset/covid-19-time-series-metrics-by-county-and-state/resource/046cdd2b-31e5-4d34-9ed3-b48cdbc4be7a

Documentation comes in two parts:

"Additional Info" on this page https://data.chhs.ca.gov/dataset/covid-19-time-series-metrics-by-county-and-state
Dictionary file https://data.chhs.ca.gov/dataset/covid-19-time-series-metrics-by-county-and-state/resource/e6667716-5ec6-499f-aeab-0e085020135a

Dataset 2: California Department of Public Health, COVID-19 Hospital Data

CSV file: https://data.chhs.ca.gov/dataset/covid-19-hospital-data/resource/47af979d-8685-4981-bced-96a6b79d3ed5

Documentation also in two parts:

"Additional Info" on this page https://data.chhs.ca.gov/dataset/covid-19-hospital-data
Dictionary file https://data.chhs.ca.gov/dataset/covid-19-hospital-data/resource/15e2b847-d9c2-4523-bdc1-f22020da079e

Dataset 3: US Department of Health & Human Services, COVID-19 Reported Patient Impact and Hospital Capacity by Facility

CSV file: export or API from here https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/anag-cw7u/about_data

Documentation: the text in the same page

Dataset 4: John Hopkins Center for Systems Science and Engineering, Unified COVID-19 Dataset (Estimates)

CSV file: COVID-19_Estimates.csv.xz https://github.com/hsbadr/COVID-19_Estimates?tab=readme-ov-file

Documentation: README.md in the repo

Dataset 5: New York Times, Coronavirus (Covid-19) Data in the United States (Archived)

CSV files: us-counties-202*.csv https://github.com/nytimes/covid-19-data

Documentation: README.md in the repo

Dataset 6: John Hopkins Center for Systems Science and Engineering, Unified COVID-19 Dataset

Note: This is a really big and probably the most challenging dataset to process since it attempts to track many COVID-related time-series down to the local level.

CSV file: COVID-19.csv.xz https://github.com/CSSEGISandData/COVID-19_Unified-Dataset?tab=readme-ov-file

Documentation: "Case Types" of README.md in the repo

liunelson commented 6 days ago

I found the JSON schema + example for the AMR model metadata section: https://github.com/gyorilab/mira/blob/e468059089681c7cd457acc51821b5bd1074df04/docs/model_metadata_annotation.md?plain=1#L21

j2whiting commented 13 hours ago

@jryu01

We need to also parse out these fields during document -> config extraction

description: this is a long form text description of what the parameter represents units: units distribution: a dictionary which contains the following keys, type and parameters type: distribution type,, cauchy, gaussian, exponential, ... etc parameters: the parameters that dictate the shape of the distribution. A normal distribution will have parameters "mu" and "sigma" for mean and standard deviation. This field should be a dictionary

We will probably need to update this adapter so that it correctly reformats the new keys and values into a format that the HMI expects

https://github.com/DARPA-ASKEM/GoLLM/blob/6728383b01ca99ea6c075df0d0409bc97b08d46b/gollm/utils.py#L86

DARPA-ASKEM / terarium