braintrustdata / autoevals

AutoEvals is a tool for quickly and easily evaluating AI model outputs using best practices.
MIT License
152 stars 17 forks source link

(`autoevals` JS): Better support for evaluating based on pre-generated answer #84

Closed mongodben closed 2 months ago

mongodben commented 2 months ago

Currently, it's less than straightforward to run evals if the answer is pre-generated, or based on case-specific data beyond the input.

This is because the Eval's task() function only accepts the input string as an argument.

I think it's important to be able to evaluate against pre-generated outputs so that we can decouple the evaluation stage (in Braintrust) from the dataset generation stage, which doesn't necessarily require Braintrust.

Here's my current implementation, which relies on creating a closure over the task() function to iterate thought pre-generated responses:

import { Eval } from "braintrust";
import { Faithfulness, AnswerRelevancy, ContextRelevancy } from "autoevals";
import "dotenv/config";
import { strict as assert } from "assert";
assert(process.env.OPENAI_OPENAI_API_KEY, "need openai key from openai");
const openAiApiKey = process.env.OPENAI_OPENAI_API_KEY;
const model = "gpt-4o-mini";
const evaluatorLlmConf = {
  openAiApiKey,
  model,
};
/**
  Evaluate whether the output is faithful to the model input.
 */
const makeAnswerFaithfulness = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return Faithfulness({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

/**
  Evaluate whether answer is relevant to the input.
 */
const makeAnswerRelevance = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return AnswerRelevancy({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

/**
  Evaluate whether context is relevant to the input.
 */
const makeContextRelevance = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return ContextRelevancy({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

const dataset = [
  {
    input: "What is the capital of France",
    tags: ["paris"],
    metadata: {
      context: [
        "The capital of France is Paris.",
        "Berlin is the capital of Germany.",
      ],
    },
    output: "Paris is the capital of France.",
  },
  {
    input: "Who wrote Harry Potter",
    tags: ["harry-potter"],
    metadata: {
      context: [
        "Harry Potter was written by J.K. Rowling.",
        "The Lord of the Rings was written by J.R.R. Tolkien.",
      ],
    },
    output: "J.R.R. Tolkien wrote Harry Potter.",
  },
  {
    input: "What is the largest planet in our solar system",
    tags: ["jupiter"],
    metadata: {
      context: [
        "Jupiter is the largest planet in our solar system.",
        "Saturn has the largest rings in our solar system.",
      ],
    },
    output: "Saturn is the largest planet in our solar system.",
  },
];

// The relevant code for this issue. Note the closure.
function makeGeneratedAnswerReturner(outputs: string[]) {
  // closure over iterator
  let counter = 0;
  return async (_input: string) => {
    counter++;
    return outputs[counter - 1];
  };
}

Eval("mdb-test", {
  experimentName: "rag-metrics",
  metadata: {
    testing: true,
  },

  data: () => {
    return dataset;
  },
  task: makeGeneratedAnswerReturner(dataset.map((d) => d.output)),
  scores: [makeAnswerFaithfulness, makeAnswerRelevance, makeContextRelevance],
});

While this seems to work fine, it would be clearer and less reliant on closures (which some folks might be less familiar with), if you could pass additional data to the task function.

I think a straight-forward way to do this would be to allow passing all the contents of the Data object being evaluated to the task() function.

This'd give the task function a signature like:

interface Data {
  input: string;
  expected?: string;
  tags?: string[];
  metadata: Record<string, string>;
}
type TaskFunc = (input: Data) => string;

Then I could include any pre-generated answers or other logic that I want to use in the Data.metadata object. For example, this could look like:

const dataset = [
  {
    input: "What is the capital of France",
    tags: ["paris"],
    metadata: {
      context: [
        "The capital of France is Paris.",
        "Berlin is the capital of Germany.",
      ],
      output: "Paris is the capital of France.",
    },

  },
];

Eval("mdb-test", {
  experimentName: "rag-metrics",
  data: () => {
    return dataset;
  },
  // Now the task() func takes the whole data object
  task(data) {
    return data.metadata.output
  },
  scores: [makeAnswerFaithfulness, makeAnswerRelevance, makeContextRelevance],
});
ankrgyl commented 2 months ago

I would suggest using the logging SDK (see here: https://www.braintrust.dev/docs/guides/evals/write#logging-sdk) if you have pre-generated outputs.

Also closing this out since it doesn't specifically have to do with autoevals. If you have further questions or feedback about this, feel free to ping us on Discord or in the SDK repo (https://github.com/braintrustdata/braintrust-sdk).

mongodben commented 2 months ago

I would suggest using the logging SDK (see here: https://www.braintrust.dev/docs/guides/evals/write#logging-sdk) if you have pre-generated outputs.

this seems like exactly what i need, thank you!