error in saving model with deepspeed

renrenzsbbb commented 6 months ago

Thanks for your great work. I can successfully train model by your proposed code in A100 without any error. However, training in V100 with deepspeed, I meet "Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data" when I save checkpoint to disk. My deepspeed config is compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero_stage: 2 distributed_type: DEEPSPEED fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 8 use_cpu: false can you give me some suggestion, thanks.

wangxr1999 commented 6 months ago

same problem

johndpope commented 6 months ago

from pasting your comment into chatgpt v4

It's great to hear that you were able to successfully train your model using an A100 GPU. However, encountering an error with NCCL operations while training on a V100 GPU with DeepSpeed is indeed a concern that needs to be addressed. The error message you're encountering, "Some NCCL operations have failed or timed out," typically indicates issues related to GPU communication in distributed training setups. Here are some suggestions to troubleshoot and potentially resolve this issue:

NCCL Debugging: Enable NCCL debugging to get more detailed logs. This can be done by setting the environment variable NCCL_DEBUG=INFO or NCCL_DEBUG=WARN. This will provide more insights into what might be going wrong with the NCCL operations.
Check GPU Compatibility: Ensure that the V100 GPUs are compatible with the version of NCCL and CUDA you are using. Sometimes, specific versions of these libraries have compatibility issues with certain GPU architectures.
DeepSpeed Configuration: Review your DeepSpeed configuration. You are using ZeRO Stage 2, which is quite memory-efficient but can be demanding in terms of inter-GPU communication. You might want to experiment with ZeRO Stage 1 or even disabling ZeRO to see if the problem persists.
Reduce Model Size or Batch Size: Temporarily reduce the size of your model or the batch size to check if it's a memory issue. Over-subscription of GPU memory can sometimes lead to obscure errors.
Update Libraries: Ensure that DeepSpeed, PyTorch, CUDA, and NCCL are all updated to their latest versions. Sometimes, bugs in these libraries can cause such issues, and updates often contain fixes.
Network Communication: If you are training across multiple machines or using a multi-GPU setup, there could be network-related issues. Check the network bandwidth and stability.
File System Load: The error occurring at the checkpoint saving phase could indicate a file system bottleneck. Ensure that the file system can handle the I/O load, especially if you're saving large checkpoints.
Hardware Check: Run some diagnostics on your V100 GPUs to ensure they are functioning correctly. Hardware issues, though rare, can lead to such errors.
Experiment with gradient_accumulation_steps: Sometimes, changing the gradient_accumulation_steps can impact the way memory is handled. Try increasing it to see if it affects the stability of your training.
Consult DeepSpeed Documentation/Community: Check the DeepSpeed documentation for any similar issues or considerations. Also, reaching out to the DeepSpeed community (like GitHub issues or forums) with your specific error can be helpful.

Remember, distributed training, especially with advanced optimizations like ZeRO, can be quite complex, and issues might stem from a combination of factors including software configuration, network setup, and hardware capabilities.

Kebii commented 4 months ago

I meet the same problem. Have you solved it?

zhengrchan commented 4 months ago

The deepspeed needs to save model in all processes, not only the main process. So just remove the accelerator.is_main_process in the saving part should work. Otherwise, other processes will wait for saving models and will cause NCCL timeout error after 30 minutes.

duanjiding commented 4 months ago

from pasting your comment into chatgpt v4

It's great to hear that you were able to successfully train your model using an A100 GPU. However, encountering an error with NCCL operations while training on a V100 GPU with DeepSpeed is indeed a concern that needs to be addressed. The error message you're encountering, "Some NCCL operations have failed or timed out," typically indicates issues related to GPU communication in distributed training setups. Here are some suggestions to troubleshoot and potentially resolve this issue:

NCCL Debugging: Enable NCCL debugging to get more detailed logs. This can be done by setting the environment variable NCCL_DEBUG=INFO or NCCL_DEBUG=WARN. This will provide more insights into what might be going wrong with the NCCL operations.

Check GPU Compatibility: Ensure that the V100 GPUs are compatible with the version of NCCL and CUDA you are using. Sometimes, specific versions of these libraries have compatibility issues with certain GPU architectures.

DeepSpeed Configuration: Review your DeepSpeed configuration. You are using ZeRO Stage 2, which is quite memory-efficient but can be demanding in terms of inter-GPU communication. You might want to experiment with ZeRO Stage 1 or even disabling ZeRO to see if the problem persists.

Reduce Model Size or Batch Size: Temporarily reduce the size of your model or the batch size to check if it's a memory issue. Over-subscription of GPU memory can sometimes lead to obscure errors.

Update Libraries: Ensure that DeepSpeed, PyTorch, CUDA, and NCCL are all updated to their latest versions. Sometimes, bugs in these libraries can cause such issues, and updates often contain fixes.

Network Communication: If you are training across multiple machines or using a multi-GPU setup, there could be network-related issues. Check the network bandwidth and stability.

File System Load: The error occurring at the checkpoint saving phase could indicate a file system bottleneck. Ensure that the file system can handle the I/O load, especially if you're saving large checkpoints.

Hardware Check: Run some diagnostics on your V100 GPUs to ensure they are functioning correctly. Hardware issues, though rare, can lead to such errors.

Experiment with gradient_accumulation_steps: Sometimes, changing the gradient_accumulation_steps can impact the way memory is handled. Try increasing it to see if it affects the stability of your training.

Consult DeepSpeed Documentation/Community: Check the DeepSpeed documentation for any similar issues or considerations. Also, reaching out to the DeepSpeed community (like GitHub issues or forums) with your specific error can be helpful.

Remember, distributed training, especially with advanced optimizations like ZeRO, can be quite complex, and issues might stem from a combination of factors including software configuration, network setup, and hardware capabilities.

hello, may i ask you how big your V100 memory is?

MooreThreads / Moore-AnimateAnyone

error in saving model with deepspeed #100