Dev and Whole Data of LogicGame benchmark. Paper Arxiv
We introduce benchmark LogicGame, designed to evaluate the logic rule understanding, execution, and planning capabilities of Large Language Models (LLMs). LogicGame features diverse games with predefined regulations, specifically created to assess logical reasoning independent of mere knowledge. The benchmark tests models across various difficulty levels, aiming for a comprehensive evaluation of performance on rule-based reasoning and multi-step execution and planning.
This project is a benchmark data release that includes four .jsonl
files: en_dev
, zh_dev
, en_all
, and zh_all
. These files represent the English and Chinese versions of the development (dev) and complete sets (whole sets), respectively. Each language version corresponds directly with its counterpart. The dev sets contain 10 entries each, whereas the whole sets contain 304 entries each.
The zh_all
and en_all
files are used as input data for Our Codabench Submission where you can utilize the contexts provided in these files as prompts to obtain model responses for evaluation. The dev set is primarily for detailed demonstration purposes.
The definition of metrics in both Tables can be found in the Paper. The best performance is marked in bold.
Model | AP-Acc% | A-Acc% | P-Acc% | IFError% | JSError% |
---|---|---|---|---|---|
o1-preview | 54.93 | 67.11 | 66.85 | 0.00 | 0.00 |
o1-mini | 51.97 | 63.49 | 64.97 | 0.00 | 0.00 |
claude-3-5-sonnet | 30.26 | 39.47 | 43.20 | 0.00 | 0.00 |
gpt-4o | 26.97 | 35.86 | 39.25 | 0.33 | 0.00 |
gpt-4-turbo-0409 | 25.66 | 32.24 | 38.18 | 0.99 | 0.66 |
glm-4-plus | 21.71 | 28.29 | 32.76 | 9.21 | 0.33 |
qwen2-72b | 20.39 | 27.96 | 32.61 | 2.63 | 0.99 |
llama-3-70b | 12.50 | 19.41 | 21.62 | 13.16 | 0.33 |
claude-3-haiku | 9.54 | 14.80 | 16.82 | 2.63 | 0.33 |
glm-4-9b | 7.57 | 12.83 | 11.27 | 20.39 | 0.99 |
internlm2-5-7b | 4.61 | 7.24 | 9.81 | 11.18 | 3.29 |
llama-3-8b | 3.62 | 5.26 | 9.31 | 35.53 | 0.00 |
mistral-7b | 2.96 | 3.95 | 6.63 | 26.32 | 6.25 |
qwen2-7b | 2.63 | 4.61 | 7.32 | 3.29 | 2.96 |
Model | AP-Acc% | A-Acc% | P-Acc% | IFError% | JSError% |
---|---|---|---|---|---|
o1-preview | 53.29 | 65.46 | 64.82 | 0.33 | 0.00 |
o1-mini | 49.67 | 61.18 | 63.25 | 0.66 | 0.33 |
claude-3-5-sonnet | 29.28 | 37.17 | 43.48 | 0.33 | 0.00 |
gpt-4o | 28.29 | 41.12 | 42.43 | 0.66 | 0.66 |
gpt-4-turbo-0409 | 21.05 | 28.95 | 33.83 | 0.66 | 0.99 |
glm-4-plus | 17.76 | 24.34 | 28.36 | 6.91 | 0.66 |
qwen2-72b | 8.88 | 13.82 | 18.56 | 24.67 | 0.66 |
glm-4-9b | 7.89 | 9.87 | 13.05 | 17.11 | 1.64 |
internlm2-5-7b | 6.25 | 7.89 | 13.06 | 13.16 | 1.32 |
claude-3-haiku | 4.93 | 8.55 | 12.60 | 0.00 | 1.64 |
llama-3-70b | 4.61 | 8.55 | 11.44 | 55.59 | 0.33 |
mistral-7b | 4.28 | 5.26 | 6.95 | 17.43 | 8.88 |
qwen2-7b | 1.64 | 3.95 | 5.56 | 1.64 | 8.22 |
llama-3-8b | 0.00 | 1.64 | 2.85 | 68.42 | 0.33 |