Closed karthik-nexusflow closed 3 weeks ago
using llama3 70b across 3 A100 nodes
File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 351, in _end_of_forward_hook self.get_param_coordinator(training=False).reset_step() File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 204, in reset_step raise RuntimeError(f"still have inflight params " RuntimeError: still have inflight params [{'id': 723, 'status': 'AVAILABLE', 'numel': 4194304, 'ds_numel': 4194304, 'shape': (512, 8192), 'ds_shape': (512, 8192)
try other deepspeed versions.
using llama3 70b across 3 A100 nodes
File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 351, in _end_of_forward_hook
self.get_param_coordinator(training=False).reset_step()
File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 204, in reset_step
raise RuntimeError(f"still have inflight params "
RuntimeError: still have inflight params [{'id': 723, 'status': 'AVAILABLE', 'numel': 4194304, 'ds_numel': 4194304, 'shape': (512, 8192), 'ds_shape': (512, 8192)