OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)
https://openrlhf.readthedocs.io/
Apache License 2.0
1.73k stars 164 forks source link

Qwen-32B train RM using adam_offload& zero3 lead to Runtime Error #343

Open victorShawFan opened 6 days ago

victorShawFan commented 6 days ago

尝试训练Qwen-32B的RM,zero3起不来,尝试adam_offload,显示bug:

image

训练脚本:

image
victorShawFan commented 6 days ago

已在别处看到作者issue:https://github.com/microsoft/DeepSpeed/issues/5469 参考issue:https://github.com/microsoft/DeepSpeed/issues/5538 当前deepspeed版本正是v0.14.2正在尝试降低版本至0.14.0

catqaq commented 5 days ago

这个没有测试过,RM的训练一般没有那么耗资源。欢迎同步进展~