Arize-ai / phoenix

AI Observability & Evaluation
https://docs.arize.com/phoenix
Other
3.92k stars 292 forks source link

🗺️ Datasets and Experiments #2017

Closed mikeldking closed 4 months ago

mikeldking commented 10 months ago

As a user, I'd like to have the notion of a dataset of records over which I can run an application or a set of evals. Common dataset purposes are:

Motivation

LLM outputs are non-deterministic and teams need a proper way to evaluate the system. With datasets, teams can select a “test suite” of data points that they can evaluate changes on. This allows them to have trust in their application when they make modifications such as:

Use-cases

Datasets will contain data from various data sources:

Pre-deployment

Post-deployment

Architecture

Dataset

A dataset maintains a set of records. These records are versioned such that if a record is modified (added/edited/deleted) these changes are tracked and versioned. These versions must be immutable such that if there is code that depends on a version, the data does not change.

Dataset Examples

A dataset is a set of examples. These examples contain:

In addition to the above a dataset record should optionally have

Dataset Experiment

A dataset experiment that is run using the examples of a dataset. Experiments are tied to a specific dataset version and have a duration of time. During an experiment, certain parts of an LLM application's components is being modified. This includes:

Planning

Infra

Tables

Rest API

GraphQL

Experiments SDK

OpenInference

UI

Tests

Bugs

Documentation

Punt

mikeldking commented 5 months ago

Note that if a span does not meet certain criteria (like embeddings) it might make sense to avoid allowing it to be added to a dataset

axiomofjoy commented 5 months ago

Note that if a span does not meet certain criteria (like embeddings) it might make sense to avoid allowing it to be added to a dataset

What other criteria can we think of?

mikeldking commented 5 months ago

As a user I want to be able to correct an eval if I deem it to be wrong

mikeldking commented 5 months ago

Thinking trials can be done via just repeating an experiment and adding the right constraint to do generation per example. This will create a more ideal UX and troubleshooting flow

mikeldking commented 4 months ago

Shipped! 🚢