Closed mikeldking closed 4 months ago
Note that if a span does not meet certain criteria (like embeddings) it might make sense to avoid allowing it to be added to a dataset
Note that if a span does not meet certain criteria (like embeddings) it might make sense to avoid allowing it to be added to a dataset
What other criteria can we think of?
As a user I want to be able to correct an eval if I deem it to be wrong
Thinking trials can be done via just repeating an experiment and adding the right constraint to do generation per example. This will create a more ideal UX and troubleshooting flow
Shipped! 🚢
As a user, I'd like to have the notion of a dataset of records over which I can run an application or a set of evals. Common dataset purposes are:
Motivation
LLM outputs are non-deterministic and teams need a proper way to evaluate the system. With datasets, teams can select a “test suite” of data points that they can evaluate changes on. This allows them to have trust in their application when they make modifications such as:
Use-cases
Datasets will contain data from various data sources:
Pre-deployment
Post-deployment
Architecture
Dataset
A dataset maintains a set of records. These records are versioned such that if a record is modified (added/edited/deleted) these changes are tracked and versioned. These versions must be immutable such that if there is code that depends on a version, the data does not change.
Dataset Examples
A dataset is a set of examples. These examples contain:
In addition to the above a dataset record should optionally have
Dataset Experiment
A dataset experiment that is run using the examples of a dataset. Experiments are tied to a specific dataset version and have a duration of time. During an experiment, certain parts of an LLM application's components is being modified. This includes:
Planning
Infra
Tables
Rest API
GraphQL
Experiments SDK
OpenInference
UI
Tests
Bugs
Documentation
Punt