Xylon2 / interrobench

A benchmark for LLMs
0 stars 0 forks source link

Better collection of metrics #1

Open Xylon2 opened 14 hours ago

Xylon2 commented 14 hours ago

Although a summary is printed at the end of each run, I think it would be good to collect more detailed metrics and store them in some sort of database.

The goal would be that I can query it to find out things like:

Xylon2 commented 14 hours ago

I don't normally use databases this way but I imagine this could be done with a couple of tables in a SQL database. Maybe like: Table runs:

Table attempts:

Xylon2 commented 10 hours ago

I'm hoping this might help to find any failures that are idiosyncratic of certain LLMs.