OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)
https://openrlhf.readthedocs.io/
Apache License 2.0
1.71k stars 160 forks source link

zero3 training error #312

Closed karthik-nexusflow closed 3 weeks ago

karthik-nexusflow commented 1 month ago

using llama3 70b across 3 A100 nodes

File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 351, in _end_of_forward_hook
self.get_param_coordinator(training=False).reset_step()
File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 204, in reset_step
raise RuntimeError(f"still have inflight params "
RuntimeError: still have inflight params [{'id': 723, 'status': 'AVAILABLE', 'numel': 4194304, 'ds_numel': 4194304, 'shape': (512, 8192), 'ds_shape': (512, 8192)

hijkzzz commented 1 month ago

try other deepspeed versions.