Closed 1wheel closed 3 months ago
Yeah. Claude 3.5 seems to fail a similar way here
I'm actually more or less okay with this "failure" mode. The prompting is explicit enough on what the model is supposed to do, and if it doesn't, then the model is wrong.
I definitely agree though that it definitely under-reports utility. Maybe there could be a "self-correction" mode that tried to let it fix dumb mistakes it made.
Ah, I missed that models extract their own code. Seems fair then.
disconnectedchildren
isn't in the model generated output:https://nicholas.carlini.com/writing/2024/evaluation_examples/make_tree_from_text.py.TestMakeTreeEasy_claude-3-5-sonnet-20240620.html#tab1
Not sure the best way of fixing generally — could rerun extraction when code tasks fail or something slightly fancier like making sure all the whitespace trimmed lines of generated code are present in the extracted code and vice versa.