Better collection of metrics

Xylon2 / interrobench

A benchmark for LLMs

0 stars 0 forks source link

Better collection of metrics #1

Open Xylon2 opened 14 hours ago

Xylon2 commented 14 hours ago

Although a summary is printed at the end of each run, I think it would be good to collect more detailed metrics and store them in some sort of database.

The goal would be that I can query it to find out things like:

which question do the LLMs get wrong most often?
which questions does this LLM get wrong that others normally get right?
what is the failure rate for this question for this LLM?
which questions cause the most errors?

Xylon2 commented 14 hours ago

I don't normally use databases this way but I imagine this could be done with a couple of tables in a SQL database. Maybe like: Table runs:

id
model identifier
benchmark version
config (json dump)
datetime start
datetime end
final score

Table attempts:

id
run_id
question
result
attempt index
time taken
tool calls

Xylon2 commented 10 hours ago

I'm hoping this might help to find any failures that are idiosyncratic of certain LLMs.