UKGovernmentBEIS / inspect_ai

Inspect: A framework for large language model evaluations
https://inspect.ai-safety-institute.org.uk/
MIT License
587 stars 105 forks source link

plot_results(): Are there any frameworks that allow summarising and visualising inspect logs? #704

Open sohaibimran7 opened 4 days ago

sohaibimran7 commented 4 days ago

Many evaluation tools have frameworks to allow summarising and visualising results. An example is zeno for lm-eval-harness. I understand that results-summarisation & visualisation needs can be quite diverse and one tool may not work for anyone. Still, I think if inspect ai logs can be easily summarised and visualised, researchers could iterate faster. I wrote a very quick and dirty class for visualising a list of EvalLogInfos for my own experiments and was wondering what other people use and whether there is interest in results summarisation visualisation support for inspect.

jjallaire-aisi commented 4 days ago

This is definitely something we are interested in supporting more deeply! We are soon going to make it possible to run a set of analysis code on top of an eval-set and then display that in the viewer. At the same time, we will hopefully discover some useful common idioms and tools that we can provide. Would love to hear from people on this thread about what the general shape of requirements are!

sohaibimran7 commented 4 days ago

I personally would value the following in a visualisation framework:

  1. Ability to categorise logs by {log_dir, run_id, task, dataset, scorer and model}
  2. More finely categorise based on substrings of {model, task, log_dir}
  3. Ability to filter, sort and rename logs based on the categorisations
  4. Ability to map each category to a plotting element {x axis, y axis, x offset, y offset, colour, horizontal and vertical faceting in a multi-plot figure}
  5. Ability to plot any figure I like (bar charts, box plots, violins etc.)
  6. Extensibility