FullFact / health-misinfo-shared

Raphael health misinformation project, shared by Full Fact and Google
MIT License
0 stars 0 forks source link

Evaluation: write function to run evaluation code during promptfoo evaluation #81

Closed dearden closed 4 months ago

dearden commented 4 months ago

Overview

We have a script which lets us run promptfoo to compare prompts. This is good. However, we need to write the actual tests which are done on the prompt output.

We want to automatically run our evaluation code on each promptfoo evaluation output.

This will involve adapting code found in src/evaluation.py.

For the test we'll need a python script, which matches a certain format. This is from the docs:

This file will be called with an output string and an AssertContext object (see above). It expects that either a bool (pass/fail), float (score), or GradingResult will be returned.

Requirements

  1. Write a python script with a get_assert method, with the inputs/outputs specified in the documentation (see notes).
  2. Write python tests to verify that the script successfully runs evaluation on known input.
  3. Evaluation should produce an f1-score, precision, and recall
  4. Will have a sensible threshold that determines pass/fail on the test.

Notes and additional information

Example of a script

from typing import Dict, TypedDict, Union

def get_assert(output: str, context) -> Union[bool, float, Dict[str, Any]]:
    print('Prompt:', context['prompt'])
    print('Vars', context['vars']['topic'])

    # This return is an example GradingResult dict
    return {
      'pass': True,
      'score': 0.6,
      'reason': 'Looks good to me',
    }