🐛[BUG]: SequentialSolver breaking when executed in parallel

Version

1.2.0

On which installation method(s) does this occur?

Pip

Describe the issue

While executing the taylor_green.py example using the SLURM directive srun, the solver breaks. I'm running the taylor_green.py example in parallel using 4 NVIDIA V100 16 GB graphics cards.

Minimum reproducible example

srun --ntasks-per-node 4 python3 taylor_green.py

Relevant log output

rm: cannot remove './outputs': No such file or directory

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [jwc09n000i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [jwc09n000i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [jwc09n000i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [jwc09n000i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Error executing job with overrides: []
Traceback (most recent call last):
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 166, in <module>
    run()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/hydra/utils.py", line 104, in func_decorated
    _run_hydra(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
Error executing job with overrides: []
    return func()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
Traceback (most recent call last):
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 166, in <module>
    _ = ret.return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 162, in run
    run()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/hydra/utils.py", line 104, in func_decorated
    slv.solve()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/sequential.py", line 138, in solve
    _run_hydra(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    self._train_loop(sigterm_handler)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 535, in _train_loop
    _run_app(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    loss, losses = self._cuda_graph_training_step(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 716, in _cuda_graph_training_step
    run_and_report(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    self.loss_static, self.losses_static = self.compute_gradients(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 68, in adam_compute_gradients
Error executing job with overrides: []
    raise ex
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    losses_minibatch = self.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses
Traceback (most recent call last):
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 166, in <module>
    return func()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    return self.domain.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses
    lambda: hydra.run(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    constraint.forward()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward
    _ = ret.return_value
    run()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/hydra/utils.py", line 104, in func_decorated
    self._output_vars = self.model(self._input_vars)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    raise self._return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    _run_hydra(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
Error executing job with overrides: []
    ret.return_value = task_function(task_cfg)
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 162, in run
    _run_app(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    return self._call_impl(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
Traceback (most recent call last):
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 166, in <module>
    slv.solve()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/sequential.py", line 138, in solve
    run_and_report(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    self._train_loop(sigterm_handler)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 535, in _train_loop
    raise ex
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return forward_call(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
    run()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/hydra/utils.py", line 104, in func_decorated
    loss, losses = self._cuda_graph_training_step(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 716, in _cuda_graph_training_step
    return func()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    _run_hydra(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    lambda: hydra.run(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    self.loss_static, self.losses_static = self.compute_gradients(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 68, in adam_compute_gradients
    _run_app(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
    losses_minibatch = self.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses
    _ = ret.return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    run_and_report(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    return self.domain.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses
    raise self._return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    raise ex
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    constraint.forward()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
    ret.return_value = task_function(task_cfg)
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 162, in run
    return func()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    self._output_vars = self.model(self._input_vars)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    slv.solve()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/sequential.py", line 138, in solve
    lambda: hydra.run(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    self._train_loop(sigterm_handler)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 535, in _train_loop
    _ = ret.return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    return self._call_impl(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    raise self._return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    loss, losses = self._cuda_graph_training_step(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 716, in _cuda_graph_training_step
    ret.return_value = task_function(task_cfg)
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 162, in run
    slv.solve()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/sequential.py", line 138, in solve
    return forward_call(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
    self._train_loop(sigterm_handler)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 535, in _train_loop
    self.loss_static, self.losses_static = self.compute_gradients(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 68, in adam_compute_gradients
    loss, losses = self._cuda_graph_training_step(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 716, in _cuda_graph_training_step
    losses_minibatch = self.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
    return self.domain.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses
    self.loss_static, self.losses_static = self.compute_gradients(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 68, in adam_compute_gradients
    constraint.forward()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward
    losses_minibatch = self.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 3: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
    return self.domain.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses
    self._output_vars = self.model(self._input_vars)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    constraint.forward()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward
    self._output_vars = self.model(self._input_vars)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return self._call_impl(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
    return forward_call(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 2: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
srun: error: jwc09n000: tasks 0-3: Exited with exit code 1

Environment details

I'm using the JUWELS supercomputer at FZJ, using the 1.2.0 Modulus pip installation.

Other/Misc.

No response

Below is the error with 2 GPUs.

[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [gpu07]:12355 (errno: 97 - Address family not supported by protocol). [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:12355 (errno: 97 - Address family not supported by protocol). [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [gpu07]:12355 (errno: 97 - Address family not supported by protocol). /glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default. See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information. ret = run_job( /glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default. See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information. ret = run_job( Error executing job with overrides: [] Error executing job with overrides: [] Traceback (most recent call last): File "/glb/data/ptxd_dash/inlvi6/ffd_modulus/multigpu/taylor_green3/taylor_green.py", line 162, in run slv.solve() File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/solver/sequential.py", line 138, in solve self._train_loop(sigterm_handler) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 543, in _train_loop loss, losses = self._cuda_graph_training_step(step) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 724, in _cuda_graph_training_step self.loss_static, self.losses_static = self.compute_gradients( File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 76, in adam_compute_gradients losses_minibatch = self.compute_losses(step) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses return self.domain.compute_losses(step) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses constraint.forward() File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward self._output_vars = self.model(self._input_vars) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward inputs, kwargs = self._pre_forward(inputs, kwargs) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. Traceback (most recent call last): File "/glb/data/ptxd_dash/inlvi6/ffd_modulus/multigpu/taylor_green3/taylor_green.py", line 162, in run slv.solve() File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/solver/sequential.py", line 138, in solve self._train_loop(sigterm_handler) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 543, in _train_loop loss, losses = self._cuda_graph_training_step(step) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 724, in _cuda_graph_training_step self.loss_static, self.losses_static = self.compute_gradients( File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 76, in adam_compute_gradients losses_minibatch = self.compute_losses(step) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses return self.domain.compute_losses(step) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses constraint.forward() File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward self._output_vars = self.model(self._input_vars) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward inputs, kwargs = self._pre_forward(inputs, kwargs) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. srun: error: gpu07: tasks 0-1: Exited with exit code 1

NVIDIA / modulus-sym