MARIO-Math-Reasoning / Super_MARIO

MIT License
258 stars 17 forks source link

数学推理本身是个非对称二元博弈问题 #4

Closed hxypqr closed 6 months ago

hxypqr commented 6 months ago

似乎文章中并没有给出在波束搜索的剪枝里面排序算法是如何实现的,也没给出奖励信号是从哪来的,放在围棋的框架下至少奖励信号的实现是相当关键的,就算只用minmax算法也应该站在证明者的对立角度选取出使用证明者此时的策略下最坏的情况,当然这个选取可以用启发式函数,但是在论文中没有看出这如何实现,而且难道不应该playout把数值套进去把整个证明过程走完吗。请赐教

lovecambi commented 6 months ago

We didn't formulate our work in the framework of game theory, but rather a search problem. Our training data consist of the ground truths for math problems. The reward is define as whether the final answer suggested by the terminal node is correct or not. During the inference, the ground truth is not available. Then, we simply employ the value model to estimate the value of every node, regardless of whether it is a terminal node.

As to step beam search, you can simply consider it as a greedy algorithm.

sparsh35 commented 6 months ago

Any ETA when the code will be released and the evaluation benchmark, I tested the model weights but it is worse than Deepseek Math RL model maybe because of lack of format used by you guys.

lovecambi commented 6 months ago

Any ETA when the code will be released and the evaluation benchmark, I tested the model weights but it is worse than Deepseek Math RL model maybe because of lack of format used by you guys.

The code for greedy and step beam has been released. We are working on the code cleanup for MCTS. It will be released soon.

Deepseek Math RL model is trained on 760K annotation dataset, while our approach does not require annotations. In general, the performance of our model without value estimation, e.g., greedy decoding (~53.5%), is worse than Deepseek Math RL model. However, with value estimation, e.g., step level beam search, our approach can easily achieve ~62%.