Open jkhansell opened 11 months ago
@jkhansell I'm not sure if the issue is stemming from the conda issues at the start of the log or something else. Can you try running a very simple script with the same distributed config as your job above to see if this minimal example works?
from modulus.sym.distributed import DistributedManager
DistributedManager.initialize()
manager = DistributedManager()
print(f"rank: {manager.rank} of {manager.world_size}, "
f"initialization method: {manager._initialization_method}")
I am also facing the same issue, when using the sequential solver.
@leolalson Thanks. Can you run the small script above and share the log here as well to help debug this issue?
Below is the error with 2 GPUs.
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [gpu07]:12355 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:12355 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [gpu07]:12355 (errno: 97 - Address family not supported by protocol).
/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
Error executing job with overrides: []
Error executing job with overrides: []
Traceback (most recent call last):
File "/glb/data/ptxd_dash/inlvi6/ffd_modulus/multigpu/taylor_green3/taylor_green.py", line 162, in run
slv.solve()
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/solver/sequential.py", line 138, in solve
self._train_loop(sigterm_handler)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 543, in _train_loop
loss, losses = self._cuda_graph_training_step(step)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 724, in _cuda_graph_training_step
self.loss_static, self.losses_static = self.compute_gradients(
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 76, in adam_compute_gradients
losses_minibatch = self.compute_losses(step)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses
return self.domain.compute_losses(step)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses
constraint.forward()
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward
self._output_vars = self.model(self._input_vars)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
inputs, kwargs = self._pre_forward(inputs, kwargs)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True
to torch.nn.parallel.DistributedDataParallel
, and by
making sure all forward
function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward
function. Please include the loss function and the structure of the return value of forward
of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Traceback (most recent call last):
File "/glb/data/ptxd_dash/inlvi6/ffd_modulus/multigpu/taylor_green3/taylor_green.py", line 162, in run
slv.solve()
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/solver/sequential.py", line 138, in solve
self._train_loop(sigterm_handler)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 543, in _train_loop
loss, losses = self._cuda_graph_training_step(step)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 724, in _cuda_graph_training_step
self.loss_static, self.losses_static = self.compute_gradients(
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 76, in adam_compute_gradients
losses_minibatch = self.compute_losses(step)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses
return self.domain.compute_losses(step)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses
constraint.forward()
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward
self._output_vars = self.model(self._input_vars)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
inputs, kwargs = self._pre_forward(inputs, kwargs)
File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True
to torch.nn.parallel.DistributedDataParallel
, and by
making sure all forward
function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward
function. Please include the loss function and the structure of the return value of forward
of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. srun: error: gpu07: tasks 0-1: Exited with exit code 1
@akshaysubr Any update on this issue?
@leolalson Looks like the issue is because certain parameters of the model are not involved in the loss function computation. For example, the surface_pde example uses the Poisson equation and because of that some linear parts of the model don't actually impact the loss function. You can try avoiding this issue by setting find_unused_parameters: True
in the config like in the surface_pde example: https://github.com/NVIDIA/modulus-sym/blob/f59eba4d852a65cc80f703da754a87e51ba44d9d/examples/surface_pde/sphere/conf/config.yaml#L27
Version
1.2.0
On which installation method(s) does this occur?
Pip
Describe the issue
While executing the taylor_green.py example using the SLURM directive srun, the solver breaks. I'm running the taylor_green.py example in parallel using 4 NVIDIA V100 16 GB graphics cards.
Minimum reproducible example
Relevant log output
Environment details
Other/Misc.
No response