NVIDIA / modulus-sym

Framework providing pythonic APIs, algorithms and utilities to be used with Modulus core to physics inform model training as well as higher level abstraction for domain experts
https://developer.nvidia.com/modulus
Apache License 2.0
165 stars 68 forks source link

šŸ›[BUG]: SequentialSolver breaking when executed in parallel #81

Open jkhansell opened 11 months ago

jkhansell commented 11 months ago

Version

1.2.0

On which installation method(s) does this occur?

Pip

Describe the issue

While executing the taylor_green.py example using the SLURM directive srun, the solver breaks. I'm running the taylor_green.py example in parallel using 4 NVIDIA V100 16 GB graphics cards.

Minimum reproducible example

srun --ntasks-per-node 4 python3 taylor_green.py

Relevant log output

rm: cannot remove './outputs': No such file or directory

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [jwc09n000i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [jwc09n000i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [jwc09n000i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [jwc09n000i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Error executing job with overrides: []
Traceback (most recent call last):
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 166, in <module>
    run()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/hydra/utils.py", line 104, in func_decorated
    _run_hydra(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
Error executing job with overrides: []
    return func()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
Traceback (most recent call last):
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 166, in <module>
    _ = ret.return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 162, in run
    run()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/hydra/utils.py", line 104, in func_decorated
    slv.solve()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/sequential.py", line 138, in solve
    _run_hydra(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    self._train_loop(sigterm_handler)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 535, in _train_loop
    _run_app(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    loss, losses = self._cuda_graph_training_step(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 716, in _cuda_graph_training_step
    run_and_report(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    self.loss_static, self.losses_static = self.compute_gradients(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 68, in adam_compute_gradients
Error executing job with overrides: []
    raise ex
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    losses_minibatch = self.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses
Traceback (most recent call last):
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 166, in <module>
    return func()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    return self.domain.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses
    lambda: hydra.run(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    constraint.forward()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward
    _ = ret.return_value
    run()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/hydra/utils.py", line 104, in func_decorated
    self._output_vars = self.model(self._input_vars)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    raise self._return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    _run_hydra(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
Error executing job with overrides: []
    ret.return_value = task_function(task_cfg)
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 162, in run
    _run_app(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    return self._call_impl(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
Traceback (most recent call last):
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 166, in <module>
    slv.solve()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/sequential.py", line 138, in solve
    run_and_report(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    self._train_loop(sigterm_handler)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 535, in _train_loop
    raise ex
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return forward_call(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
    run()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/hydra/utils.py", line 104, in func_decorated
    loss, losses = self._cuda_graph_training_step(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 716, in _cuda_graph_training_step
    return func()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    _run_hydra(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    lambda: hydra.run(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    self.loss_static, self.losses_static = self.compute_gradients(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 68, in adam_compute_gradients
    _run_app(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
    losses_minibatch = self.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses
    _ = ret.return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    run_and_report(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    return self.domain.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses
    raise self._return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    raise ex
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    constraint.forward()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
    ret.return_value = task_function(task_cfg)
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 162, in run
    return func()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    self._output_vars = self.model(self._input_vars)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    slv.solve()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/sequential.py", line 138, in solve
    lambda: hydra.run(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    self._train_loop(sigterm_handler)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 535, in _train_loop
    _ = ret.return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    return self._call_impl(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    raise self._return_value
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    loss, losses = self._cuda_graph_training_step(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 716, in _cuda_graph_training_step
    ret.return_value = task_function(task_cfg)
  File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 162, in run
    slv.solve()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/sequential.py", line 138, in solve
    return forward_call(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
    self._train_loop(sigterm_handler)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 535, in _train_loop
    self.loss_static, self.losses_static = self.compute_gradients(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 68, in adam_compute_gradients
    loss, losses = self._cuda_graph_training_step(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 716, in _cuda_graph_training_step
    losses_minibatch = self.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
    return self.domain.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses
    self.loss_static, self.losses_static = self.compute_gradients(
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 68, in adam_compute_gradients
    constraint.forward()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward
    losses_minibatch = self.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 3: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
    return self.domain.compute_losses(step)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses
    self._output_vars = self.model(self._input_vars)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    constraint.forward()
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward
    self._output_vars = self.model(self._input_vars)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return self._call_impl(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
    return forward_call(*args, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 2: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
srun: error: jwc09n000: tasks 0-3: Exited with exit code 1

Environment details

I'm using the JUWELS supercomputer at FZJ, using the 1.2.0 Modulus pip installation.

Other/Misc.

No response

akshaysubr commented 8 months ago

@jkhansell I'm not sure if the issue is stemming from the conda issues at the start of the log or something else. Can you try running a very simple script with the same distributed config as your job above to see if this minimal example works?

from modulus.sym.distributed import DistributedManager

DistributedManager.initialize()
manager = DistributedManager()
print(f"rank: {manager.rank} of {manager.world_size}, "
      f"initialization method: {manager._initialization_method}")
leolalson commented 7 months ago

I am also facing the same issue, when using the sequential solver.

akshaysubr commented 7 months ago

@leolalson Thanks. Can you run the small script above and share the log here as well to help debug this issue?

leolalson commented 7 months ago

Below is the error with 2 GPUs.

[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [gpu07]:12355 (errno: 97 - Address family not supported by protocol). [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:12355 (errno: 97 - Address family not supported by protocol). [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [gpu07]:12355 (errno: 97 - Address family not supported by protocol). /glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default. See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information. ret = run_job( /glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default. See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information. ret = run_job( Error executing job with overrides: [] Error executing job with overrides: [] Traceback (most recent call last): File "/glb/data/ptxd_dash/inlvi6/ffd_modulus/multigpu/taylor_green3/taylor_green.py", line 162, in run slv.solve() File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/solver/sequential.py", line 138, in solve self._train_loop(sigterm_handler) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 543, in _train_loop loss, losses = self._cuda_graph_training_step(step) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 724, in _cuda_graph_training_step self.loss_static, self.losses_static = self.compute_gradients( File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 76, in adam_compute_gradients losses_minibatch = self.compute_losses(step) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses return self.domain.compute_losses(step) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses constraint.forward() File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward self._output_vars = self.model(self._input_vars) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward inputs, kwargs = self._pre_forward(inputs, kwargs) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. Traceback (most recent call last): File "/glb/data/ptxd_dash/inlvi6/ffd_modulus/multigpu/taylor_green3/taylor_green.py", line 162, in run slv.solve() File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/solver/sequential.py", line 138, in solve self._train_loop(sigterm_handler) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 543, in _train_loop loss, losses = self._cuda_graph_training_step(step) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 724, in _cuda_graph_training_step self.loss_static, self.losses_static = self.compute_gradients( File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/trainer.py", line 76, in adam_compute_gradients losses_minibatch = self.compute_losses(step) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses return self.domain.compute_losses(step) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses constraint.forward() File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward self._output_vars = self.model(self._input_vars) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward inputs, kwargs = self._pre_forward(inputs, kwargs) File "/glb/data/ptxd_dash/inlvi6/apps/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. srun: error: gpu07: tasks 0-1: Exited with exit code 1

leolalson commented 6 months ago

@akshaysubr Any update on this issue?

akshaysubr commented 5 months ago

@leolalson Looks like the issue is because certain parameters of the model are not involved in the loss function computation. For example, the surface_pde example uses the Poisson equation and because of that some linear parts of the model don't actually impact the loss function. You can try avoiding this issue by setting find_unused_parameters: True in the config like in the surface_pde example: https://github.com/NVIDIA/modulus-sym/blob/f59eba4d852a65cc80f703da754a87e51ba44d9d/examples/surface_pde/sphere/conf/config.yaml#L27