1.A longer chat history can help improve the LLM's reasoning ability by providing more context for refinement and critique.
2.PPRM can offer a more robust reward signal through the contrastive learning method.
3.Using a learning-based or consistency-based summarizer can further enhance the reasoning ability.
further reading
@article{zhang2024llama,
title={LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning},
author={Zhang, Di and Wu, Jianbo and Lei, Jingdi and Che, Tong and Li, Jiatong and Xie, Tong and Huang, Xiaoshui and Zhang, Shufei and Pavone, Marco and Li, Yuqiang and others},
journal={arXiv preprint arXiv:2410.02884},
year={2024}
}
@article{zhang2024accessing,
title={Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B},
author={Zhang, Di and Li, Jiatong and Huang, Xiaoshui and Zhou, Dongzhan and Li, Yuqiang and Ouyang, Wanli},
journal={arXiv preprint arXiv:2406.07394},
year={2024}
}
@misc{qi2024mutualreasoningmakessmaller,
title={Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers},
author={Zhenting Qi and Mingyuan Ma and Jiahang Xu and Li Lyna Zhang and Fan Yang and Mao Yang},
year={2024},
eprint={2408.06195},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.06195},
}
1.A longer chat history can help improve the LLM's reasoning ability by providing more context for refinement and critique. 2.PPRM can offer a more robust reward signal through the contrastive learning method. 3.Using a learning-based or consistency-based summarizer can further enhance the reasoning ability.
further reading