Hypatiaalegra / LogicGame-Data

Dev and Test Data of LogicGame benchmark
Apache License 2.0
7 stars 0 forks source link

Clarification on evaluation protocol #1

Open winston-zillow opened 3 days ago

winston-zillow commented 3 days ago

In the paper, it said,

Process Accuracy(P-Acc): This metric assesses the correctness of the process, measuring the percentage match based on character-level similarity between the provided process and the expected process

My understanding is that the process is the plan the LLM comes up arrive the solution.

  1. Are all tasks in the dataset guaranteed to have only one correct process?
  2. Are the outputs normalized before evaluating character-level similarity? Can you provide the code snippets used?
  3. I noticed that you also reported the IFError and JSError. In case of such mechanical errors, do you fix the outputs and proceed to evaluate the AP-Acc or you just mark those sample unit with 0 scores?
Hypatiaalegra commented 2 days ago

Thank you for raising these questions. Your understanding of the "process" as the plan an LLM uses to arrive at a solution is correct.

  1. Yes, we have specifically designed the tasks in the dataset and the output constrain to ensure that each task can only be solved correctly by following one exact process.
  2. Yes, if by normalization you mean the parsing of the JSON format in the model's response. We detect the json section in the response and parse it into the correct JSON format for evaluation.
  3. No, we do not fix the outputs in cases of mechanical errors like IFError or JSError. These samples with mechanical errors are marked with 0 scores. This is because mechanical errors can manifest in various forms and sometimes there is no straightforward fix, which often requires extensive human intervention that is not feasible for us.
winston-zillow commented 2 days ago

Yes, if by normalization you mean the parsing of the JSON format in the model's response. We detect the json section in the response and parse it into the correct JSON format for evaluation.

I get that the response JSON is extracted and parsed. But the answer fields are string values. Do you further normalize the string values?

These samples with mechanical errors are marked with 0 scores.

While avoiding mechanical errors is a good thing, would this bias toward those models following instruction or formatting better rather than reasoning better? (for example, I heard that later verisons of gpt-4* was specifically improved for generating JSON.) As the goal is evaluate reasoning capability, it may be better to fix at least some of the errors (like the JSError) by human, as long as the fix guideline and fix rates are also reported.

winston-zillow commented 2 days ago

character-level similarity

What is this done exactly? Also can you provide code snippets for the exact similarity measurement? Do you use edit distance?

If a process answer of length 100 inserts an extra character at the 6th position, then a left-to-right match would report only 5% accuracy whereas an edit-distance would report 99% match.