UKGovernmentBEIS / inspect_ai

Inspect: A framework for large language model evaluations
https://inspect.ai-safety-institute.org.uk/
MIT License
626 stars 118 forks source link

Question about running code on eval results #828

Open AarushSah opened 1 week ago

AarushSah commented 1 week ago

Hi! Is there an easy way to run code on the output of a Task from within a Task declaration? Currently, I'm doing something along the lines of this:

results = eval(my_eval(), model="groq/llama3-8b-8192")
process_eval(results)

but I'd like to be able to define some code to run on the output of the eval within the function that defines the task, so that process_eval runs even when I call the my_eval with the CLI. Is there any native way I can do that?

Thanks in advance!

jjallaire commented 1 week ago

There isn't currently a hook for something like this. If you could further delineate the specific use case we can consider what might work well. Note that we are likely to soon provide some ability to control task execution (have Task include a run() function you can override or comparable w/ functional hooks) which would probably fit the bill. We've also discussed introducing a results filter that lets you make arbitrary changes to the log file before its returned/written.

AarushSah commented 1 week ago

The goal is to provide more detailed speed metrics from timestamps and token counts available in the EvalLog. The run feature would be awesome - and being able to write the new metrics to the logfile would be awesome as well.

jjallaire commented 1 week ago

I wonder if some of this you could just do in a solver or scorer? Exactly which data structures are you wanting to access and compute on?

AarushSah commented 1 week ago
  1. EvalLog.results:

    • total_samples: Total number of samples in the evaluation
    • completed_samples: Number of successfully completed samples
  2. EvalLog.samples: Contains individual sample data, where each sample has:

    • events: List of events for each sample, containing:
      • event: Type of event (e.g., 'sample_init', 'step', 'model')
      • timestamp: When the event occurred
      • For 'model' events specifically (this is Groq specific):
      • call.response['usage'] contains:
        • prompt_time: Time taken for prompt processing
        • completion_time: Time taken for completion generation
        • total_time: Total processing time
        • prompt_tokens: Number of input tokens
        • completion_tokens: Number of output tokens
        • total_tokens: Total tokens processed
  3. EvalLog.eval:

    • model: Name of the model being evaluated
  4. EvalLog.plan.config:

    • seed: Seed used
AarushSah commented 1 week ago

hey @jjallaire! just following up - is it possible to access all of the above from a scorer or a solver? All of this data is needed for what I'm computing.