OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)
https://openrlhf.readthedocs.io/
Apache License 2.0
1.71k stars 160 forks source link

Moving model between GPU and CPU #315

Closed kfertakis closed 3 weeks ago

kfertakis commented 3 weeks ago

Hi,

I'm running PPO without ray on a single device and I'm experimenting with moving some models out of the GPU. Trying torch.nn.module.cpu() on either actor or critic model initialised with deepspeed will not work because these models contain an optimiser which has references to the model parameters, breaking the torch moving functionality. Is there any way I could achieve moving the model without needing to destroy it and re-instantiate it?

Thank you,

hijkzzz commented 3 weeks ago

This is a bit complicated, I'm not sure if DeepSpeed supports this, but I recommend using deepspeed's offload feature

kfertakis commented 3 weeks ago

Thanks. Deepspeed's offload feature is only offloading the optimiser on the CPU if I'm not mistaken. What I'm trying to achieve is to move both the model (with the parameters) and the optimiser from the GPU to the CPU and release the previously allocated GPU memory. Is there perhaps a proposed way to approach this? Thanks

hijkzzz commented 3 weeks ago

Thanks. Deepspeed's offload feature is only offloading the optimiser on the CPU if I'm not mistaken. What I'm trying to achieve is to move both the model (with the parameters) and the optimiser from the GPU to the CPU and release the previously allocated GPU memory. Is there perhaps a proposed way to approach this? Thanks

deepspeed supports model weights offloading