Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.37k stars 3.38k forks source link

The problem shows: version incompatibility from v1.3.x to v2.4 #20308

Open sunhan3787 opened 1 month ago

sunhan3787 commented 1 month ago

Bug description

lightning_fabric.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: ['train_loss'...]. Condition can be set using monitor key in lr scheduler dict

What version are you seeing the problem on?

v2.4

How to reproduce the bug

https://github.com/jiaor17/DiffCSP
I'm trying to run this work in the pl v2.4.0, the normal problem i had fixed after have a look at your Lightning documentation.
 Is the only way to fixed this problem is to modify my confige file?

Error messages and logs

# Error messages and logs here please

File "/data/coding/DiffCSP/diffcsp/run.py", line 181, in main run(cfg) File "/data/coding/DiffCSP/diffcsp/run.py", line 168, in run trainer.fit(model=model, datamodule=datamodule, ckpt_path=ckpt) File "/data/miniconda/envs/torch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 538, in fit call._call_and_handle_interrupt( File "/data/miniconda/envs/torch/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/data/miniconda/envs/torch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/data/miniconda/envs/torch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run results = self._run_stage() File "/data/miniconda/envs/torch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1025, in _run_stage self.fit_loop.run() File "/data/miniconda/envs/torch/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 206, in run self.on_advance_end() File "/data/miniconda/envs/torch/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 386, in on_advance_end self.epoch_loop.update_lr_schedulers("epoch", update_plateau_schedulers=not self.restarting) File "/data/miniconda/envs/torch/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 349, in update_lr_schedulers self._update_learning_rates(interval=interval, update_plateau_schedulers=update_plateau_schedulers) File "/data/miniconda/envs/torch/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 384, in _update_learning_rates raise MisconfigurationException( lightning_fabric.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: ['train_loss', 'train_loss_step', 'lattice_loss', 'lattice_loss_step', 'coord_loss', 'coord_loss_step', 'train_loss_epoch', 'lattice_loss_epoch', 'coord_loss_epoch']. Condition can be set using monitor key in lr scheduler dict

Environment

Current environment ``` #- PyTorch Lightning Version (e.g., 2.4.0): 2.4.0 #- PyTorch Version (e.g., 2.4): 2.3.0 #- Python version (e.g., 3.12): 3.10 #- OS (e.g., Linux): Ubuntu:"22.04.4 LTS #- CUDA/cuDNN version: cuda 12.1 #- GPU models and configuration: RTX3060 #- How you installed Lightning(`conda`, `pip`, source): pip install ```

More info

i'm sure the code is OK, and if i want to fixed this problem simple, is the best way to change a confige file?

sunhan3787 commented 1 month ago

I changed config val_loss to train_loss,it's working, but I'm not sure how this will affect my training results