Flow of Reasoning: Efficient Training of LLM Policy with Divergent Thinking

Official code for "Flow of Reasoning: Efficient Training of LLM Policy with Divergent Thinking" Also check our [Project Page]

plot

Training & Inference

plot

Our FoR formulates multi-step reasoning tasks as flow:

Design reward $R(s_n)$ of terminal states for different tasks.
Collect trajectories with the local search technique.
Training LLM policy $P_{F}$ with trajectory balance loss.

Code

1) Download this GitHub

git clone https://github.com/Yu-Fangxu/FoR.git

2) Prepare the environment

We recommend conda for setting up a reproducible experiment environment. We include environment.yaml for creating a working environment:

bash install.sh

3) Choose 1 of 3 tasks to run

cd BlocksWorld|Game24|prontoqa

Check more detailed instructions in each branch.

Citation

@article{yu2024flow,
  title={Flow of Reasoning: Efficient Training of LLM Policy with Divergent Thinking},
  author={Yu, Fangxu and Jiang, Lai and Kang, Haoqiang and Hao, Shibo and Qin, Lianhui},
  journal={arXiv preprint arXiv:2406.05673},
  year={2024}
}

Yu-Fangxu / FoR

readme

Flow of Reasoning: Efficient Training of LLM Policy with Divergent Thinking

Training & Inference

Code

Citation