NVIDIA / modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
https://developer.nvidia.com/modulus
Apache License 2.0
791 stars 165 forks source link

🐛[BUG]: PyTorch DDP Fails for CorrDiff training before the training starts #523

Closed pgarg7 closed 1 month ago

pgarg7 commented 1 month ago

Version

Modulus 24.04

On which installation method(s) does this occur?

Docker

Describe the issue

PyTorch DDP throws an error about converting group to int resulting into an overflow when there is a multi-GPU training for CorrDiff. The network and its weights are all fine as I checked each rank's network parameters in the log file. As pointed out on the slack channel, the issue is most likely due to not specifying wandb group while initializing on rank 0.

Minimum reproducible example

if dist.world_size > 1:
     ddp = DistributedDataParallel(
            net,
            device_ids=[dist.local_rank],
            broadcast_buffers=True,
            output_device=dist.device,
            find_unused_parameters=dist.find_unused_parameters,
        )
else:
     ddp = net

Relevant log output

Traceback (most recent call last):
  File "/lustre/fs4/portfolios/coreai/users/piyushg/modulus/examples/generative/corrdiff/train_goes.py", line 346, in main
    training_loop_goes.training_loop(
  File "/lustre/fs4/portfolios/coreai/users/piyushg/modulus/examples/generative/corrdiff/training/training_loop_goes.py", line 184, in training_loop
    ddp = DistributedDataParallel(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 798, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 269, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
  File "/lustre/fs4/portfolios/coreai/users/piyushg/modulus/examples/generative/corrdiff/train_goes.py", line 346, in main
    training_loop_goes.training_loop(
  File "/lustre/fs4/portfolios/coreai/users/piyushg/modulus/examples/generative/corrdiff/training/training_loop_goes.py", line 184, in training_loop
    ddp = DistributedDataParallel(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 798, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 269, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: value cannot be converted to type int without overflow

Environment details

Enroot import docker://nvcr.io/nvidia/modulus/modulus:24.04
mnabian commented 1 month ago

Confirmed that specifying the group arg upon initializing wandb fixes this issue.