Am launching a script taht trains a model which works well when trained without ddp and using gradient checkpointing, or using ddp but no gradient checkpointing, using fabric too. However, when setting both ddp and gradient checkpointing, activate thorugh gradient_checkpointing_enable() function of huggingface, we get error
[rank0]: File "/home/.../v2/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: RuntimeError: expect_autograd_hooks_ INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/reducer.cpp":1591, please report a bug to PyTorch.
When i launch with options strategy=DDPStrategy(process_group_backend="nccl", find_unused_parameters=True, static_graph=False), I get error instead:
[rank0]: Parameter at index 560 with name reader.decoder.transformer.h.11.mlp.c_proj.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.
Thanks in advance for your help.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
Current environment
```
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
```
Bug description
Am launching a script taht trains a model which works well when trained without ddp and using gradient checkpointing, or using ddp but no gradient checkpointing, using fabric too. However, when setting both ddp and gradient checkpointing, activate thorugh gradient_checkpointing_enable() function of huggingface, we get error
Scripts where launched with
When i launch with options
strategy=DDPStrategy(process_group_backend="nccl", find_unused_parameters=True, static_graph=False)
, I get error instead:Thanks in advance for your help.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
``` #- PyTorch Lightning Version (e.g., 2.4.0): #- PyTorch Version (e.g., 2.4): #- Python version (e.g., 3.12): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): ```More info
No response