Open johnhaofu opened 9 hours ago
Thank you for your attention.
As you mentioned, we also noted in our README that we have identified limitations in our current reward function, which is a significant constraint on our model's capabilities. From the perspective of test@k, this has had a considerable impact on the final performance.
Additionally, we are currently training our reward model. We believe that as the precision of the reward improves, the performance of our model will improve as well.
Thank you for your detailed response and for acknowledging the limitations of the current reward function. It's great to know that you are already working on training a reward model to address this issue.
Given the impact of the reward function on the test@k performance, I believe that incorporating task-specific knowledge or logical rules into the reward evaluation could provide a significant boost. For example:
Introducing global consistency checks to ensure reasoning paths align with task goals. Integrating intermediate validation steps for complex tasks, such as math or multi-step reasoning problems, to evaluate sub-path correctness. If possible, could you share more about the approach you are using to train the reward model? Are you focusing on supervised learning with labeled paths, or exploring reinforcement learning techniques? I'd love to hear your thoughts on how to balance model flexibility with precision in reward evaluation.
Looking forward to your insights and progress updates!
Initially, we still plan to use ORM + MCTS, as this type of data is relatively easy to obtain and we already have a considerable amount of it. Meanwhile, these tree search data can be used as unsupervised labels to train the PRM. After collecting some data, we can train our PRM. Our ultimate goal is MCTS+PRM+RL. I hope this answer addresses your question.
Thank you for the clarification! Your plan of starting with ORM + MCTS and using tree search results as unsupervised labels for PRM training sounds solid. Excited to see how this develops!
The current reward function in Marco-o1's MCTS implementation relies solely on token-level confidence scores derived from the model's output probabilities. While this method provides a straightforward way to evaluate reasoning paths, it has notable limitations:
Local Optimality: Token-level probabilities may lead to paths that seem promising locally but fail to achieve global correctness. Model Bias: The model's inherent biases might result in overconfidence in certain common patterns, misguiding the search process. Context Insensitivity: The reward function does not evaluate the logical consistency of the tokens in the broader context of the reasoning path. Lack of Task-Specificity: The reward function is generic and does not incorporate domain-specific knowledge or logical rules pertinent to the task.