PyTorch DDP throws an error about converting group to int resulting into an overflow when there is a multi-GPU training for CorrDiff. The network and its weights are all fine as I checked each rank's network parameters in the log file. As pointed out on the slack channel, the issue is most likely due to not specifying wandb group while initializing on rank 0.
Minimum reproducible example
if dist.world_size > 1:
ddp = DistributedDataParallel(
net,
device_ids=[dist.local_rank],
broadcast_buffers=True,
output_device=dist.device,
find_unused_parameters=dist.find_unused_parameters,
)
else:
ddp = net
Relevant log output
Traceback (most recent call last):
File "/lustre/fs4/portfolios/coreai/users/piyushg/modulus/examples/generative/corrdiff/train_goes.py", line 346, in main
training_loop_goes.training_loop(
File "/lustre/fs4/portfolios/coreai/users/piyushg/modulus/examples/generative/corrdiff/training/training_loop_goes.py", line 184, in training_loop
ddp = DistributedDataParallel(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 798, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 269, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
File "/lustre/fs4/portfolios/coreai/users/piyushg/modulus/examples/generative/corrdiff/train_goes.py", line 346, in main
training_loop_goes.training_loop(
File "/lustre/fs4/portfolios/coreai/users/piyushg/modulus/examples/generative/corrdiff/training/training_loop_goes.py", line 184, in training_loop
ddp = DistributedDataParallel(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 798, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 269, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: value cannot be converted to type int without overflow
Version
Modulus 24.04
On which installation method(s) does this occur?
Docker
Describe the issue
PyTorch DDP throws an error about converting group to int resulting into an overflow when there is a multi-GPU training for CorrDiff. The network and its weights are all fine as I checked each rank's network parameters in the log file. As pointed out on the slack channel, the issue is most likely due to not specifying wandb group while initializing on rank 0.
Minimum reproducible example
Relevant log output
Environment details