Open Eric-Lin-CVTE opened 4 months ago
Hey @Eric-Lin-CVTE
Thank you for sharing the code. Could you help making the code more minimal to reproduce the issue? Because the code imports from a lib
package.
A good approach to narrow down where the issue is, is by disabling certain features to check that they are unrelated. For example, disable validation to check if the issue is in validation or not. Or disable checkpointing, etc.
I encountered a similar issue. After checking, I found that the time spent in the forward and step parts of the model increases rapidly as the number of epochs increases. Further analysis showed that PyTorch Lightning's automatic zero grad operation is failing. In other words, even without using gradient accumulation, parameter gradients still exist before each forward pass. Manually implementing zeroing, forward, backward, etc., in the training_step resolved the issue.
Bug description
when I run the following code, the training time of the epoch will increase epoch by epoch. For example, the first epoch takes 3:39 min, and the second on takes 4:21min, and the third one takes 5:46 min ..., I don't know why. The following is my code . And the version of lightning my used is 2.3.1
What version are you seeing the problem on?
master
How to reproduce the bug
Error messages and logs
Environment
Current environment
``` #- PyTorch Lightning Version (e.g., 1.5.0): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): ```More info
No response
cc @borda