training gets frozen while using multiple-GPUs

NathanYanJing commented 1 year ago

Super cool and amazing work!

I am writing to ask for your assistance with an issue I am encountering while training a model using A6000 GPUs. I am using the following command to run my code:

torchrun --nnodes=1 --nproc_per_node=4  train.py --model DiT-XL/2 --data-path training_data --global-batch-size 76 --num-workers 1

The problem I am experiencing is that the training appears to be frozen after creating the experiment directory for a long period of time. On occasion, it also throws the following error:

Traceback (most recent call last):
  File "/DiT/DiT/train.py", line 269, in <module>
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
    main(args)
  File "/DiT/DiT/train.py", line 149, in main
    model = DDP(model.to(device), device_ids=[rank])
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: DDP expects same model across all ranks, but Rank 0 has 131 params, while rank 1 has inconsistent 0 params.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
Traceback (most recent call last):
  File "/DiT/DiT/train.py", line 269, in <module>
    main(args)
  File "/DiT/DiT/train.py", line 149, in main
    model = DDP(model.to(device), device_ids=[rank])
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [2]: params[0] in this process with sizes [1152, 4, 1, 1] appears not to match sizes of the same param in process 0.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225631 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225633 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225634 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 225632) of binary: /miniconda3/envs/DiT/bin/python
Traceback (most recent call last):
  File "/miniconda3/envs/DiT/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I have not experienced this problem when training with 1, 2, or 3 nodes.

I apologize for my lack of experience in this area, but could you please provide any insights or guidance to help me resolve this issue? Thank you for your assistance.

NathanYanJing commented 1 year ago

Problems solved for now! In case people might encounter a similar issue, If you use single node multiple GPU, replace the DDP with the following, there is a hacky way, from torch.nn import DataParallel as DDP Or you can try the following torch.multiprocessing.set_start_method('spawn',force=True) but you might need to rewrite the lambda function to avoid the pickle issue.

wpeebles commented 1 year ago

Hi @NathanYanJing. Your torchrun command runs fine for me without any modifications to the code (also using a single-node, multi-GPU training setup). I haven't run across the error you're getting before. Depending on how you're launching the script, you might want to be a little careful with the DDP --> DataParallel change since that could change the behavior of parts of train.py that rely on distributed ops (in general I'm not sure if DataParallel plays nice with torch.distributed)

NathanYanJing commented 1 year ago

Hi @wpeebles , thanks for your reply! Yea, I agree with using torch.distributed is always a better choice.

Yes, it seems the problem comes back again somehow now -- it hangs at Dataloader part. I am guessing that this is probably NCCL and Nvidia-version issue. Would you mind sharing your NCCL and Cuda versions?

facebookresearch / DiT

training gets frozen while using multiple-GPUs #29