This package is a collection of datasets for evaluating AI Models in the context of Home Assistant. The overall approach is:
graph LR;
A[Synthetic Data Generation]
B[Dataset]
C[Model Evaluation]
D[Synthetic Home]
F[Human Annotation]
G[Visualize Results]
H[OpenAI]
I[Conversation Agent]
J[Local LM]
K[Conversation Agent]
L[Google]
M[Conversation Agent]
A --> B
B --> D
D --> C
C --> F
F --> G
H --> I
J --> K
L --> M
I --> C
K --> C
M --> C
I --> D
K --> D
M --> D
The idea would also be to start building datasets that can be used for training in the future.
See the datasets README for details on the available datasets including Home descriptions, Area descriptions, Device descriptions and summaries that can be performed on a home.
The device level datasets are defined using the Synthetic Home format including its device registry of synthetic devices.
See the generation README for more details on how synthetic data generation using LLMs works. The data is generated from a small amount of seed example data and a prompt, then is persisted.
The synthetic data generation is run with Jupyter notebooks.
classDiagram
direction LR
Home <|-- Area
Area <|-- Device
Device <|-- EntityStates
class Home{
+String name
+String country_code
+String location
+String type
}
class Area {
+String name
}
class Device {
+String name
+String device_type
+String model
+String mfg
+String sw_version
}
class EntityState {
+String state
}
You can use the generated synthetic data in Home Assistat and with integrated converation agents to produce outputs for evaluation.
Model evaluation is currently performed with pytest, Synthetic Home, and any conversation agent (Open AI, Google, custom components, etc)
The most commonly used evaluation is for the Home Assistant conversation agent actions for integrating with the assist pipeline. See the following dataset directories for more information on running an evaluation:
Models are configured in models.yaml
.
There are additional datasets for human evaluation of summarization tasks. These were the initial use case for this repo. It works something like this:
These can be used for human evaluation to determine the model quality. In this phase, we take the model outputs from a human rater and use them for evaluation.
Human rater (me) scores the result quality:
See the script/ directory for more details on preparing the data for human eval procedure using Doccano.