Open Xylon2 opened 14 hours ago
I don't normally use databases this way but I imagine this could be done with a couple of tables in a SQL database. Maybe like: Table runs:
Table attempts:
I'm hoping this might help to find any failures that are idiosyncratic of certain LLMs.
Although a summary is printed at the end of each run, I think it would be good to collect more detailed metrics and store them in some sort of database.
The goal would be that I can query it to find out things like: