Open winston-zillow opened 3 days ago
Thank you for raising these questions. Your understanding of the "process" as the plan an LLM uses to arrive at a solution is correct.
Yes, if by normalization you mean the parsing of the JSON format in the model's response. We detect the json section in the response and parse it into the correct JSON format for evaluation.
I get that the response JSON is extracted and parsed. But the answer fields are string values. Do you further normalize the string values?
These samples with mechanical errors are marked with 0 scores.
While avoiding mechanical errors is a good thing, would this bias toward those models following instruction or formatting better rather than reasoning better? (for example, I heard that later verisons of gpt-4* was specifically improved for generating JSON.) As the goal is evaluate reasoning capability, it may be better to fix at least some of the errors (like the JSError) by human, as long as the fix guideline and fix rates are also reported.
character-level similarity
What is this done exactly? Also can you provide code snippets for the exact similarity measurement? Do you use edit distance?
If a process answer of length 100 inserts an extra character at the 6th position, then a left-to-right match would report only 5% accuracy whereas an edit-distance would report 99% match.
In the paper, it said,
My understanding is that the process is the plan the LLM comes up arrive the solution.