braintrustdata / autoevals

AutoEvals is a tool for quickly and easily evaluating AI model outputs using best practices.
MIT License
263 stars 19 forks source link

(`autoevals` JS) Better support and documentation for using context-based evaluators in `Eval` run #82

Open mongodben opened 4 months ago

mongodben commented 4 months ago

It could be clearer how to use the evaluators that use "context" in addition to input and output in the Eval run, such as Faithfulness and ContextRelevancy.

Right now, I'm including contexts in the metadata. I only figured this out after few hours of poking around since the behavior is undocumented.

Here's an annotated version of my code which worked:

import { Eval } from "braintrust";
import { Faithfulness, ContextRelevancy } from "autoevals";
import "dotenv/config";
import { strict as assert } from "assert";
assert(process.env.OPENAI_OPENAI_API_KEY, "need openai key from openai");
const openAiApiKey = process.env.OPENAI_OPENAI_API_KEY;
const model = "gpt-4o-mini";
const evaluatorLlmConf = {
  Evaluate whether the output is faithful to the model input.
const makeAnswerFaithfulness = function (args: {
  input: string;
  output: string;
  // passing context in metadata
  metadata: { context: string[] };
}) {
  return Faithfulness({
    input: args.input,
    output: args.output,
    context: args.metadata.context,

  Evaluate whether answer is relevant to the input.
const makeAnswerRelevance = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return AnswerRelevancy({
    input: args.input,
    output: args.output,
    context: args.metadata.context,

  Evaluate whether context is relevant to the input.
const makeContextRelevance = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return ContextRelevancy({
    input: args.input,
    output: args.output,
    context: args.metadata.context,

const dataset = [
    input: "What is the capital of France",
    tags: ["paris"],
    metadata: {
      // including context in metadata here as well
      context: [
        "The capital of France is Paris.",
        "Berlin is the capital of Germany.",
    output: "Paris is the capital of France.",
    input: "Who wrote Harry Potter",
    tags: ["harry-potter"],
    metadata: {
      context: [
        "Harry Potter was written by J.K. Rowling.",
        "The Lord of the Rings was written by J.R.R. Tolkien.",
    output: "J.R.R. Tolkien wrote Harry Potter.",
    input: "What is the largest planet in our solar system",
    tags: ["jupiter"],
    metadata: {
      context: [
        "Jupiter is the largest planet in our solar system.",
        "Saturn has the largest rings in our solar system.",
    output: "Saturn is the largest planet in our solar system.",

function makeGeneratedAnswerReturner(outputs: string[]) {
  // closure over iterator
  let counter = 0;
  return async (_input: string) => {
    return outputs[counter - 1];

Eval("mdb-test", {
  experimentName: "rag-metrics",
  metadata: {
    testing: true,

  data: () => {
    return dataset;
  task: makeGeneratedAnswerReturner( => d.output)),
  scores: [makeAnswerFaithfulness, makeContextRelevance],
ankrgyl commented 4 months ago

Thanks for filing. Do you have any ideas on how to make it clearer? Currently, in Typescript, we lean on the typesystem for this.

mongodben commented 4 months ago

Do you have any ideas on how to make it clearer? i think passing the context prop to the evaluators is pretty straightforward. but it could be clearer how to get this from your dataset into the evaluator.

a few ideas:

  1. document a solution similar to what i have done above, so others don't need to figure it out for themselves
  2. include context?: string[] as a first-class property for the Data passed to the data constructor function. in typescript something like:
interface Data {
  input: string;
  expected?: string;
  tags?: string[];
  metadata: Record<string, string>;
  // NEW! 
  contexts?: string[];

the evaluators like Faithfulness and AnswerRelevancy could then read directly from the Data.contexts property.

ankrgyl commented 4 months ago

Ah thanks, yes I think the first is more likely the path -- different evaluators have different conventions/names for additional arguments.