Open kirilllzaitsev opened 10 months ago
@kirilllzaitsev Can you be more specific with what you mean by "stuck"?
MNIST is very easy to optimize for this classifier. After the first epoch we already get a > 90% accuracy on the validation set:
Epoch 1: 100%|██████████| 860/860 [00:08<00:00, 97.71it/s, v_num=54, train_loss=0.242, val_loss=0.288, val_acc=0.913]
At epoch 8 I get 96%:
Epoch 8: 100%|██████████| 860/860 [00:09<00:00, 95.09it/s, v_num=54, train_loss=0.025, val_loss=0.122, val_acc=0.964]
I don't see any issues with this code, can you point it out please?
torch 2.1.1, lightning 2.1.3, torchmetrics 0.7.3
This is what I mean by "stuck": Please note that this holds only for the Trainer-way, while the standard loop is fine.
Replacing all lightning.pytorch
imports with pytorch_lightning
worked. The same code that produced the above plots gives the following, having imports as the only change:
However, this didn't fix the other use case I'm working on (which I mentioned in the question) with the same problem.
@kirilllzaitsev Did you find out anything else? I don't know how switching the imports would have fixed anything here. Are the two packges (lightning and pytorch-lightning) the same version?
Both packages are at 2.1.3
. Reinstalling lightning
is of no avail.
Bug description
I'm referring to the official MNIST example from the 1.5.0 docs, which when gathered and tweaked for 2.1.3 (also with RichProgressBar) goes as follows:
What version are you seeing the problem on?
v2.1
How to reproduce the bug
Both
train_loss
andval_loss
are stuck at their original values, being also agnostic to learning rate changes, overfitting setup (1 train sample), etc.And this is the vanilla replacement for the Trainer that makes the model work:
producing:
also having a minor extension to the module:
I experienced the issue in a completely different setup which made me go and try the MNIST one. Wondering what could that be?