carlini / yet-another-applied-llm-benchmark

A benchmark to evaluate language models on questions I've previously asked them to solve.
GNU General Public License v3.0
875 stars 64 forks source link

Noisy code extraction #18

Closed 1wheel closed 3 months ago

1wheel commented 3 months ago

disconnectedchildren isn't in the model generated output:

https://nicholas.carlini.com/writing/2024/evaluation_examples/make_tree_from_text.py.TestMakeTreeEasy_claude-3-5-sonnet-20240620.html#tab1

Not sure the best way of fixing generally — could rerun extraction when code tasks fail or something slightly fancier like making sure all the whitespace trimmed lines of generated code are present in the extracted code and vice versa.

carlini commented 3 months ago

Yeah. Claude 3.5 seems to fail a similar way here

https://nicholas.carlini.com/writing/2024/evaluation_examples/make_sqlite_table.py.TestSqlMakeTable_claude-3-5-sonnet-20240620.html#tab1

I'm actually more or less okay with this "failure" mode. The prompting is explicit enough on what the model is supposed to do, and if it doesn't, then the model is wrong.

I definitely agree though that it definitely under-reports utility. Maybe there could be a "self-correction" mode that tried to let it fix dumb mistakes it made.

1wheel commented 3 months ago

Ah, I missed that models extract their own code. Seems fair then.