请教一下流程 - Githubissues

Fu-Dayuan commented 3 days ago

想请问一下，流程是不是：

SFT，计作M0
先用M0采样采很多，计作数据集a（只采样这一次，后面就不采样了）（大概400个？）
在a中选M0 ppl合适的，ppo，训完叫M1 （跑一次multi node脚本）
在a中选M1 ppl合适的，ppo，训完叫M2（跑第二次multi node脚本）
以此类推跑十次？

（我在代码里没看到每个phase采样的代码？文章中也没有一个算法流程，感觉代码和文章有gap）

Fu-Dayuan commented 3 days ago

能提供一个算法流程吗？（同样建议在论文里也加一个）

Fu-Dayuan commented 3 days ago

以及 https://github.com/THUDM/WebRL/issues/1 里说公布了sft的数据，但readme里并没有（by the way会给ppo的数据吗？）

QZH-777 commented 3 days ago

Thank you for your interest in WebRL!

We provide SFT data in the llama-factory directory.
Our training process involves SFT training to obtain the initial model, followed by iterative phases of improvement. For each phase (e.g., phase t): (1) Use GPT-4o to generate new tasks and filter them, ultimately keeping 500 tasks at the end, and then perform rollouts on these filtered tasks. (2) Select suitable historical experiences from phases 1 through t-1 based on ppl from successful trajectories. (3) Use our designed algorithm (not PPO) to train the policy. Please refer to the paper for further details.
The interaction code will be released later this week.

Fu-Dayuan commented 3 days ago

Thank you for your interest in WebRL!

We provide SFT data in the llama-factory directory.

Our training process involves SFT training to obtain the initial model, followed by iterative phases of improvement. For each phase (e.g., phase t): (1) Use GPT-4o to generate new tasks and filter them, ultimately keeping 500 tasks at the end, and then perform rollouts on these filtered tasks. (2) Select suitable historical experiences from phases 1 through t-1 based on ppl from successful trajectories. (3) Use our designed algorithm (not PPO) to train the policy. Please refer to the paper for further details.

The interaction code will be released later this week.

sry手快打成ppo了

所以其实是两个脚本（一个采样的，一个训练的）来回执行，但是现在只给了训练的脚本是吗？（能问一下500 tasks 大概能产生多少训练数据吗？）

QZH-777 commented 3 days ago

Thank you for your interest in WebRL!

We provide SFT data in the llama-factory directory.

Our training process involves SFT training to obtain the initial model, followed by iterative phases of improvement. For each phase (e.g., phase t): (1) Use GPT-4o to generate new tasks and filter them, ultimately keeping 500 tasks at the end, and then perform rollouts on these filtered tasks. (2) Select suitable historical experiences from phases 1 through t-1 based on ppl from successful trajectories. (3) Use our designed algorithm (not PPO) to train the policy. Please refer to the paper for further details.

The interaction code will be released later this week.

sry手快打成ppo了

所以其实是两个脚本（一个采样的，一个训练的）来回执行，但是现在只给了训练的脚本是吗？（能问一下500 tasks 大概能产生多少训练数据吗？）

Yes, sampling and training need to be done alternately. 500 tasks produces roughly between 3000-5000 training data (state-action pair)

Fu-Dayuan commented 3 days ago

还有一个小问题，这里为什么要取前400呢？好像没有一个地方有这个超参数

QZH-777 commented 3 days ago

Apologies for the confusion. The parameter was originally included to truncate the data size for debugging purposes. We will update the code and remove the truncation operation. Thanks for catching that!

THUDM / WebRL

请教一下流程 #4