-
想请问一下,流程是不是:
0. SFT,计作M0
1. 先用M0采样采很多,计作数据集a(只采样这一次,后面就不采样了)(大概400个?)
2. 在a中选M0 ppl合适的,ppo,训完叫M1 (跑一次multi node脚本)
3. 在a中选M1 ppl合适的,ppo,训完叫M2(跑第二次multi node脚本)
4. 以此类推跑十次?
(我在代码里没看到每个phase采样…
-
It can provide more detailed process description documents for SFT, PPO, and inference, including model configuration, data configuration, etc. for running through the preliminary process.
-
我看PPO这里加载的agent是train on policy的,但是直接train的话并不会有经验池,但PPO中N步更新的时候不是应该有一个经验池吗,就是对应的off policy部分,这里是在哪体现出来的呢?
-
1) counter
2) for index in BatchSampler(SubsetRandomSampler(range(self.buffer_capacity), self.batch_size, True)):
-
I was running the example script: `examples/scripts/train_ppo_llama.sh`.
Basically, it's ppo on llama3-8b with 8*H100, flash_attn, zero3, gradient_checkpointing, adam_offload, but it's OOM after some…
-
Hi,
Big fan of this project! I'm trying to train an RL agent on a bunch of large environments at once, and I'm seeing an issue where some linkages are static/immobile when they shouldn't be. Here a…
-
For online training, we may have to ditch the complexities of PPO and use a more basic form of temporal difference learning that does not rely on advantage estimation.
We also need to decide which …
-
你好,感谢你提供得代码,对我来说有很大帮助,但是我在用ppo得时候出现了点问题,我是一个初学者,我在训练得时候发现连续得ppo算法接入到我自定义得环境后他得每个episode得奖励都一模一样,网络给出得动作是不同但相差非常小,不知道为什么哪里出了问题
-
Hi,
amazing work here. But the software has moved on and I would like to make it work again. So far I have:
* Fixed the code to work with the new pyelastica API
* Updated from stable_baselines to…
-
**Describe the bug**
问题1:在使用ppo训练Atom-7Bchat模型时,设置`--lora_target_modules ALL \`报错,若指定名称则不报错,`--lora_target_modules o_proj,up_proj,down_proj,v_proj,k_proj,gate_proj,q_proj \`
![baocuo](https://github…