CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
MIT License
4.51k stars 471 forks source link

resume_from_checkpoint doesn't work #577

Closed andrewsiah closed 1 year ago

andrewsiah commented 1 year ago

🐛 Describe the bug

We're trying to do iterative PPO, and want to use the resume_from_checkpoint feature here #482 by @maxreciprocate .

But when we tried to load it from the ckpt/best_checkpoint directory, I get a "no pytorch_model.bin" error, which when I check my directory, the best_checkpoint directory doesn't show a pytorch_model.bin file. (image below)

Screenshot 2023-10-25 at 7 39 42 PM

But the subdirectory hf_model has it, so I set resume_from_checkpoint=ckpt/best_checkpoint/hf_model. But then it gives me the error that is shown in #482 , or image below.

Screenshot 2023-10-25 at 7 42 22 PM

https://wandb.ai/andrew-siah/trlx/runs/i9o2eb0l/logs?workspace=user-andrew-siah

WeightsBiases output.log

Am I doing something wrong?

I can also verify that a previous run of trlx without using the resume_from_checkpoint feature works fine. So the issue is isolated to resume_from_checkpoint.

https://wandb.ai/andrew-siah/trlx/runs/a31x005e/logs?workspace=user-andrew-siah

Thanks.

Which trlX version are you using?

0.7.0

Additional system and package information

Ubuntu, Python 3.11.4

andrewsiah commented 1 year ago

My fault, previous runs didn't finish that was why the Pytorch_model.bin wasn't there. Have fixed. Thank you.