Open stalkermustang opened 2 months ago
Preliminarily: 61%. Found a bug in two of the test cases because of it, so will fix those and then upload results this weekend.
(Caveat though I don't know how to correct for: I can't control the temperature of O1. I think that 4o/3.5 sonnet do slightly better on lower temperature than I run at, so I don't know how to best handle this. It'll require some thought.)
IIRC you can't set system prompts or temperature for o1 series. There's no workaround as far as I know. The idea behind limiting the temperature is that the model needs to be "creative" during the thought process. It will generate new ideas and try new things if it struggles to progress on the chosen path.
Yeah. I'm trying to decide if I want to have the evaluation grid shown at a higher temperature so you can see more diversity in outputs, but then report the "best accuracy" as temperature=0 for other models too now or something like that.
Hey, really curious how new OAI models (either mini or preview) perform here. Looking forward to checking the updated LB 🙌🙌