QwenLM / Qwen2.5-Math

A series of math-specific large language models of our Qwen2 series.
https://qwenlm.github.io/blog/qwen2-math/
621 stars 66 forks source link

TIR实验结果异常 #26

Open wangzhihao-coder opened 1 month ago

wangzhihao-coder commented 1 month ago

我参考TIR的prompt在qwen2.5-Math的1.5B和7B模型上进行了实验,得到的指标结果比COT差,我怀疑我的实现缺少了一些步骤,能说明下更详细的实现方式嘛? 我参考下面的prompt实现了TIR

# TIR
messages = [
    {"role": "system", "content": "Please integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}."},
    {"role": "user", "content": prompt}
]
rangmiao commented 1 month ago

我也遇到了同样的问题

hav4ik commented 1 month ago
Hi, same for me, for both 1.5B and 7B. My numbers so far: Model MATH CoT MATH CoT (maj@8) MATH TIR
Qwen2.5-1.5B-Instruct 0.7646 N/A 0.7534
Qwen2.5-7B-Instruct 0.8408 0.8804 0.8088

The CoT numbers are a bit higher than reported because I used MARIO_EVAL to compare results with ground truth. I'm using vLLM for inference, with top_p=1.0 and temperature=0.


Failure cases

I checked a few failure cases and they seemed strange to me. For example, problem 50 in MATH-test benchmark. The model used a function sympy.complete_square, which clearly does not exist:

<|im_start|>system
Please integrate natural language reasoning with programs to solve the problem above, and put your final answer within \boxed{}.<|im_end|>
<|im_start|>user
Find the radius of the circle with equation $x^2 + 8x + y^2 - 6y = 0$.<|im_end|>
<|im_start|>assistant
To find the radius of the circle given by the equation \(x^2 + 8x + y^2 - 6y = 0\), we need to rewrite the equation in the standard form of a circle's equation \((x - h)^2 + (y - k)^2 = r^2\). This process is called completing the square.

Let's break it down step-by-step:

1. Start with the given equation:
   \[
   x^2 + 8x + y^2 - 6y = 0
   \]

2. Complete the square for the \(x\) terms. The coefficient of \(x\) is 8, so we take half of 8 (which is 4) and square it (which is 16). Add and subtract 16 inside the equation:
   \[
   x^2 + 8x + 16 - 16 + y^2 - 6y = 0
   \]
   This can be rewritten as:
   \[
   (x + 4)^2 - 16 + y^2 - 6y = 0
   \]

3. Next, complete the square for the \(y\) terms. The coefficient of \(y\) is -6, so we take half of -6 (which is -3) and square it (which is 9). Add and subtract 9 inside the equation:
   \[
   (x + 4)^2 - 16 + y^2 - 6y + 9 - 9 = 0
   \]
   This can be rewritten as:
   \[
   (x + 4)^2 + (y - 3)^2 - 25 = 0
   \]

4. Finally, move the constant term to the right side of the equation:
   \[
   (x + 4)^2 + (y - 3)^2 = 25
   \]

Now the equation is in the standard form \((x - h)^2 + (y - k)^2 = r^2\), where \((h, k)\) is the center of the circle and \(r\) is the radius. From the equation \((x + 4)^2 + (y - 3)^2 = 25\), we can see that the center of the circle is \((-4, 3)\) and the radius \(r\) is \(\sqrt{25} = 5\).

So, the radius of the circle is \(5\).

Let's confirm this with Python code.
```python
import sympy as sp

# Define the variables
x, y = sp.symbols('x y')

# Define the equation of the circle
equation = x**2 + 8*x + y**2 - 6*y

# Complete the square for x and y terms
x_term = sp.complete_square(x**2 + 8*x, full=True)[0]
y_term = sp.complete_square(y**2 - 6*y, full=True)[0]

# Rewrite the equation
completed_square_eq = x_term + y_term - sp.complete_square(x**2 + 8*x, full=True)[1] - sp.complete_square(y**2 - 6*y, full=True)[1]

# Simplify the equation
completed_square_eq = sp.simplify(completed_square_eq)

# Extract the radius
radius_squared = completed_square_eq.rhs
radius = sp.sqrt(radius_squared)

print(radius)

Does Qwen team has an in-house version of Sympy that contains functions not existing in the main repo?

hengck23 commented 1 month ago

Python code could be hallucinated. That explains function that cannot be found?

maybe check this: https://github.com/QwenLM/Qwen-Agent/blob/a9ef165971d37ac37ec1df9e565ca46a38b0afab/examples/tir_math.py

# We use the following two systems to distinguish between COT mode and TIR mode
TIR_SYSTEM = """Please integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}."""
COT_SYSTEM = """Please reason step by step, and put your final answer within \\boxed{}."""
            'The dependencies for Python Executor support are not installed. '
            'Please install the required dependencies by running: pip install "qwen-agent[python_executor]"') from e
XinXU-USTC commented 2 weeks ago

The same problem here. For 7B-instruct, I got 77% on GSM8K with TIR and 95.6% with CoT.