Open mickel-liu opened 3 months ago
Yes, we haven't fully developed and tested this feature yet. Welcome contribution
i'm happy to look into it, but how have you guys been saving models?
Hi @mickel-liu, have you figured this out? I have no choice but to use train_ppo_ray.py
for PPO instead of train_ppo.py
, because it doesn't OOM during model loading in my configuration. I am looking into ways to save checkpoints during/after training, and was hoping if you have delved into this feature as well.
Hi @mickel-liu, have you figured this out? I have no choice but to use
train_ppo_ray.py
for PPO instead oftrain_ppo.py
, because it doesn't OOM during model loading in my configuration. I am looking into ways to save checkpoints during/after training, and was hoping if you have delved into this feature as well.
Hi, I did look into the code and found out the saving checkpoints feature is not yet implemented. But actually saving checkpoints wasn't what I was looking for, I want the actual model checkpoints, not the intermediate states as being referred in this repo. So I ended up changing the code on my fork and now it saves model checkpoints after a pre-set amount of iterations. Here's the code in my fork: https://github.com/mickelliu/OpenRLHF/blob/a7f21aa26ac027fcf30ca1c588e01cf07c67cb6f/openrlhf/trainer/ppo_trainer.py#L428-L442
Regardless of ckpt feature is being officially implemented, train_ppo_ray.py
will save a model checkpoint at the end of the training.
Hi @mickel-liu, have you figured this out? I have no choice but to use
train_ppo_ray.py
for PPO instead oftrain_ppo.py
, because it doesn't OOM during model loading in my configuration. I am looking into ways to save checkpoints during/after training, and was hoping if you have delved into this feature as well.Hi, I did look into the code and found out the saving checkpoints feature is not yet implemented. But actually saving checkpoints wasn't what I was looking for, I want the actual model checkpoints, not the intermediate states as being referred in this repo. So I ended up changing the code on my fork and now it saves model checkpoints after a pre-set amount of iterations. Here's the code in my fork: https://github.com/mickelliu/OpenRLHF/blob/a7f21aa26ac027fcf30ca1c588e01cf07c67cb6f/openrlhf/trainer/ppo_trainer.py#L428-L442
Regardless of ckpt feature is being officially implemented,
train_ppo_ray.py
will save a model checkpoint at the end of the training.
Thanks for the quick reply and for sharing your code! I'm glad to know that saving the trained model would be that simple. Although the checkpointing feature would be a great add, this fix seems to solve my issue.
When I set
save_step
other than -1, the program outputs an exceptionhttps://github.com/OpenLLMAI/OpenRLHF/blob/3c918755faa31ee810f3624a82ba5f7879e4f8d3/openrlhf/trainer/ppo_trainer.py#L378-L385
These three args are indeed not included in
train_ppo_ray.py
and I don't seearg.save_path
being used.I did see this issue was mentioned in #133, wondering if there's any update.