Open Fu-Dayuan opened 3 days ago
能提供一个算法流程吗?(同样建议在论文里也加一个)
以及 https://github.com/THUDM/WebRL/issues/1 里说公布了sft的数据,但readme里并没有 (by the way会给ppo的数据吗?)
Thank you for your interest in WebRL!
Thank you for your interest in WebRL!
- We provide SFT data in the llama-factory directory.
- Our training process involves SFT training to obtain the initial model, followed by iterative phases of improvement. For each phase (e.g., phase t): (1) Use GPT-4o to generate new tasks and filter them, ultimately keeping 500 tasks at the end, and then perform rollouts on these filtered tasks. (2) Select suitable historical experiences from phases 1 through t-1 based on ppl from successful trajectories. (3) Use our designed algorithm (not PPO) to train the policy. Please refer to the paper for further details.
- The interaction code will be released later this week.
sry手快打成ppo了
所以其实是两个脚本(一个采样的,一个训练的)来回执行,但是现在只给了训练的脚本是吗?(能问一下500 tasks 大概能产生多少训练数据吗?)
Thank you for your interest in WebRL!
- We provide SFT data in the llama-factory directory.
- Our training process involves SFT training to obtain the initial model, followed by iterative phases of improvement. For each phase (e.g., phase t): (1) Use GPT-4o to generate new tasks and filter them, ultimately keeping 500 tasks at the end, and then perform rollouts on these filtered tasks. (2) Select suitable historical experiences from phases 1 through t-1 based on ppl from successful trajectories. (3) Use our designed algorithm (not PPO) to train the policy. Please refer to the paper for further details.
- The interaction code will be released later this week.
sry手快打成ppo了
所以其实是两个脚本(一个采样的,一个训练的)来回执行,但是现在只给了训练的脚本是吗?(能问一下500 tasks 大概能产生多少训练数据吗?)
Yes, sampling and training need to be done alternately. 500 tasks produces roughly between 3000-5000 training data (state-action pair)
还有一个小问题,这里为什么要取前400呢?好像没有一个地方有这个超参数
Apologies for the confusion. The parameter was originally included to truncate the data size for debugging purposes. We will update the code and remove the truncation operation. Thanks for catching that!
想请问一下,流程是不是:
(我在代码里没看到每个phase采样的代码?文章中也没有一个算法流程,感觉代码和文章有gap)