carlini / yet-another-applied-llm-benchmark

A benchmark to evaluate language models on questions I've previously asked them to solve.
GNU General Public License v3.0
914 stars 65 forks source link

o1 eval results #22

Open stalkermustang opened 2 months ago

stalkermustang commented 2 months ago

Hey, really curious how new OAI models (either mini or preview) perform here. Looking forward to checking the updated LB 🙌🙌

carlini commented 2 months ago

Preliminarily: 61%. Found a bug in two of the test cases because of it, so will fix those and then upload results this weekend.

carlini commented 2 months ago

(Caveat though I don't know how to correct for: I can't control the temperature of O1. I think that 4o/3.5 sonnet do slightly better on lower temperature than I run at, so I don't know how to best handle this. It'll require some thought.)

stalkermustang commented 2 months ago

IIRC you can't set system prompts or temperature for o1 series. There's no workaround as far as I know. The idea behind limiting the temperature is that the model needs to be "creative" during the thought process. It will generate new ideas and try new things if it struggles to progress on the chosen path.

carlini commented 2 months ago

Yeah. I'm trying to decide if I want to have the evaluation grid shown at a higher temperature so you can see more diversity in outputs, but then report the "best accuracy" as temperature=0 for other models too now or something like that.