mikeldking commented 10 months ago

As a user, I'd like to have the notion of a dataset of records over which I can run an application or a set of evals. Common dataset purposes are:

Golden Dataset - contains queries and golden responses
QA Asset - a set of difficult queries over which you want to perform regression testing
Sample of production data - pick and choose records from production to convert to an asset (overlaps with above)

Motivation

LLM outputs are non-deterministic and teams need a proper way to evaluate the system. With datasets, teams can select a “test suite” of data points that they can evaluate changes on. This allows them to have trust in their application when they make modifications such as:

Modifying a prompt template
Iterate on various components of application and seeing if there are differences in output
Swapping out for new model releases
See performance on a cheaper new model or a fine-tuned model

Use-cases

Datasets will contain data from various data sources:

Pre-deployment

Store and maintain a set of synthetic queries
Store and maintain a set of hand curated queries
Make a copy of huggingface / CSV data to be maintained internally

Post-deployment

Move data from production or staging for regression testing or fine-tuning

Architecture

Dataset

A dataset maintains a set of records. These records are versioned such that if a record is modified (added/edited/deleted) these changes are tracked and versioned. These versions must be immutable such that if there is code that depends on a version, the data does not change.

Dataset Examples

A dataset is a set of examples. These examples contain:

input - data passed to an LLM, prompt, or function (e.x. a retriever)
expected / output the result of the invocation of an LLM, prompt, or function
metadata - any additional information that can be used during experimentation

In addition to the above a dataset record should optionally have

metadata any additional data associated with the record (e.x. attributes from a span)
source span_rowid / trace_rowid if the data came from a span, it should link back to the source

Dataset Experiment

A dataset experiment that is run using the examples of a dataset. Experiments are tied to a specific dataset version and have a duration of time. During an experiment, certain parts of an LLM application's components is being modified. This includes:

Change in LLM or LLM params
Change in prompt template
Change in retrieval strategy

Planning

[x] #3043

Infra

[x] #2792
[x] #2914
[x] #3159

Tables

[x] #3241
[x] #3357
[x] #3416
[x] #3425
[x] #3469
[x] #3478
[x] #3482
[x] #3494
[x] #3505
[x] #3586
[x] #3792
[x] #3593
[x] #3595

Rest API

[x] #3045
[x] #3157
[x] #3158
[x] #3155
[x] #3154
[x] #3180
[x] #3214
[x] #3215
[x] #3216
[x] #3227
[x] #3228
[x] #3311
[x] #3310
[x] #3341
[x] #3355
[x] #3374
[x] #3387
[x] #3388
[x] #3389
[x] #3390
[x] #3391
[x] #3392
[x] #3420
[x] #3499
[x] #3545
[x] #3605
[x] #3653
[x] #3693
[x] #3760
[x] #3761

GraphQL

[x] #3171
[x] #3160
[x] #3161
[x] #3162
[x] #3163
[x] #3164
[x] #3182
[x] #3177
[x] #3196
[x] #3179
[x] #3340
[x] #3181
[x] #3178
[x] #3201
[x] [datasets][gql] mutation upload CSV
[x] #3218
[x] #3219
[x] #3233
[x] #3252
[x] #3339
[x] #3276
[x] #3278
[x] #3301
[x] #3303
[x] #3304
[x] #3305
[x] #3306
[x] #3326
[x] #3331
[x] #3367
[x] [datasets] [gql] contained in dataset resolver on a span
[x] #3405
[x] #3400
[x] #3399
[x] #3428
[x] #3429
[x] #3427
[x] #3455
[x] #3456
[x] #3475
[x] #3477
[x] #3491
[x] #3496
[x] #3497
[x] #3516
[x] #3560
[x] #3517
[x] #3563
[x] #3574
[x] #3580
[x] #3727

Experiments SDK

[x] #3356
[x] #3369
[x] #3432
[x] #3431
[x] #3504
[x] #3506
[x] #3507
[x] #3524
[x] #3537
[x] #3573
[x] #3578
[x] #3615
[x] #3614
[ ] https://github.com/Arize-ai/phoenix/issues/3620
[x] #3621
[x] #3624
[x] #3630
[x] #3631
[x] #3633
[x] #3655
[x] #3654
[x] #3660
[x] #3675
[x] #3681
[x] #3682
[x] #3726
[x] #3750
[x] #3763
[x] #3764

OpenInference

[x] #3634

UI

[x] #3165
[x] #3209
[x] #3166
[x] #3202
[x] #3200
[x] #3208
[x] #3353
[x] [datasets] [ui] download dataset controls
[x] #3302
[x] #3368
[x] [datasets] [ui] multi-select on examples and delete
[x] #3710
[x] [datasets] [ui] show latest version in header
[x] [datasets] [ui] copy to clipboard for dataset id, version id
[x] #3335
[x] #3307
[x] [datasets][ui] edit dataset example form
[x] #3412
[x] #3349
[x] #3366
[x] #3417
[x] #3438
[x] #3450
[x] #3467
[x] #3472
[x] #3487
[x] [experiments][UI] comparison table for experiments
[x] #3544
[x] #3521
[x] #3538
[x] #3547
[x] #3561
[x] #3559
[x] #3567
[x] #3613
[x] #3656
[x] #3676
[x] #3698

Tests

[x] #3275
[x] #3333
[x] #3359
[x] tests for task name code
[x] #3757

Bugs

[x] #3300
[x] [datasets][ui] alert for dataset example created gets hidden under slideovers
[x] #3309
[x] #3308
[x] #3332
[x] #3337
[x] #3342
[x] #3363
[x] #3466
[x] #3468
[x] #3485
[x] #3536
[x] #3553
[x] [experiments] run_experiments fails halfway through if the server is not up
[x] #3622
[x] #3623
[x] #3625
[x] #3627
[x] #3628
[x] #3632
[x] #3643
[x] #3644
[x] #3678
[x] #3705
[x] #3708
[x] #3711
[x] #3759

Documentation

[x] #3503
[x] #3562
[ ] [datasets][use-case] RAG optimization
[x] #3576
[x] [datasets] [docs] synthetic data storage
[x] #3564
[x] #3602
[x] #3601
[x] #3600
[x] #3599
[x] #3649
[x] #3702
[x] #3768
[x] #3767

Punt

[x] #3426
[x] #3762
[x] #3582

mikeldking commented 5 months ago

Note that if a span does not meet certain criteria (like embeddings) it might make sense to avoid allowing it to be added to a dataset

axiomofjoy commented 5 months ago

Note that if a span does not meet certain criteria (like embeddings) it might make sense to avoid allowing it to be added to a dataset

What other criteria can we think of?

mikeldking commented 5 months ago

As a user I want to be able to correct an eval if I deem it to be wrong

mikeldking commented 5 months ago

Thinking trials can be done via just repeating an experiment and adding the right constraint to do generation per example. This will create a more ideal UX and troubleshooting flow

mikeldking commented 4 months ago

Shipped! 🚢

Arize-ai / phoenix

🗺️ Datasets and Experiments #2017

Motivation

Use-cases

Architecture

Dataset

Dataset Examples

Dataset Experiment

Planning

Infra

Tables

Rest API

GraphQL

Experiments SDK

OpenInference

UI

Tests

Bugs

Documentation

Punt