🐛[BUG]: PyTorch DDP Fails for CorrDiff training before the training starts

Version

Modulus 24.04

On which installation method(s) does this occur?

Docker

Describe the issue

PyTorch DDP throws an error about converting group to int resulting into an overflow when there is a multi-GPU training for CorrDiff. The network and its weights are all fine as I checked each rank's network parameters in the log file. As pointed out on the slack channel, the issue is most likely due to not specifying wandb group while initializing on rank 0.

Minimum reproducible example

if dist.world_size > 1:
     ddp = DistributedDataParallel(
            net,
            device_ids=[dist.local_rank],
            broadcast_buffers=True,
            output_device=dist.device,
            find_unused_parameters=dist.find_unused_parameters,
        )
else:
     ddp = net

Relevant log output

Traceback (most recent call last):
  File "/lustre/fs4/portfolios/coreai/users/piyushg/modulus/examples/generative/corrdiff/train_goes.py", line 346, in main
    training_loop_goes.training_loop(
  File "/lustre/fs4/portfolios/coreai/users/piyushg/modulus/examples/generative/corrdiff/training/training_loop_goes.py", line 184, in training_loop
    ddp = DistributedDataParallel(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 798, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 269, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
  File "/lustre/fs4/portfolios/coreai/users/piyushg/modulus/examples/generative/corrdiff/train_goes.py", line 346, in main
    training_loop_goes.training_loop(
  File "/lustre/fs4/portfolios/coreai/users/piyushg/modulus/examples/generative/corrdiff/training/training_loop_goes.py", line 184, in training_loop
    ddp = DistributedDataParallel(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 798, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 269, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: value cannot be converted to type int without overflow

Environment details

Enroot import docker://nvcr.io/nvidia/modulus/modulus:24.04

NVIDIA / modulus

🐛[BUG]: PyTorch DDP Fails for CorrDiff training before the training starts #523