Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.38k stars 3.38k forks source link

trainer.tune() fails when Trainer.__init__(auto_lr_find=True, auto_scale_batch_size=True) #5374

Closed vineetk1 closed 1 year ago

vineetk1 commented 3 years ago

🐛 Bug

trainer.tune() works just fine when either Trainer.__init__(auto_lr_find=False, auto_scale_batch_size=True) or Trainer.__init__(auto_lr_find=True, auto_scale_batch_size=False) However, trainer.tune() fails when Trainer.__init__(auto_lr_find=True, auto_scale_batch_size=True)

LR finder stopped early due to diverging loss.

INFO   lr_finder.py:186:lr_find(): LR finder stopped early due to diverging loss.
/home/vin/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:49: UserWarning: You're resuming from a checkpoint that ended mid-epoch. This can cause unreliable results if further training is done, consider using an end of epoch checkpoint. 
  warnings.warn(*args, **kwargs)
Failed to compute suggesting for `lr`. There might not be enough points.
Traceback (most recent call last):
  File "/home/vin/.local/lib/python3.8/site-packages/pytorch_lightning/tuner/lr_finder.py", line 353, in suggestion
    min_grad = np.gradient(loss).argmin()
  File "<__array_function__ internals>", line 5, in gradient
  File "/home/vin/.local/lib/python3.8/site-packages/numpy/lib/function_base.py", line 1052, in gradient
    raise ValueError(
ValueError: Shape of array too small to calculate a numerical gradient, at least (edge_order + 1) elements are required.
ERROR  lr_finder.py:357:suggestion(): Failed to compute suggesting for `lr`. There might not be enough points.

Please reproduce using the BoringModel

To Reproduce

Use following BoringModel and post here

In your own environment trainer.tune() should fail when Trainer.init(auto_lr_find=True, auto_scale_batch_size=True). However, if you want to reproduce the bug from my code then go to Github, and fork from https://github.com/vineetk1/conversational-transaction-bot Then run the following on commandline: python3 ctbMain.py input_param_files/distilgpt2_params

Expected behavior

trainer.tune() should find the Batch-Size and the initial Learning-Rate

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

Additional context

github-actions[bot] commented 3 years ago

Hi! thanks for your contribution!, great first issue!

colllin commented 3 years ago

I ran into this too, and refactored my code to init two separate trainers and models and tune()d the batch_size and learning_rate separately. Then, out of an abundance of caution, I re-init the trainer and model again before calling fit().

In case it helps as a clue, when I was debugging this, it seemed that the auto_scale_batch_size functionality does not properly replace the original model weights after tuning. I believe the LR finder does this, and therefore might have example code.

Working code that tune()s separately:

        dm = BoringDataModule()
        model = BoringModel(...)
        trainer = pl.Trainer(..., auto_scale_batch_size=True)
        trainer.tune(model, datamodule=dm)
        print('Suggested batch size:', dm.batch_size)

        model = BoringModel(...)
        trainer = pl.Trainer(..., auto_lr_find=True)
        trainer.tune(model, datamodule=dm)
        print('Suggested learning rate:', model.hparams.learning_rate)

        model = BoringModel(...)
        trainer = pl.Trainer(...)
        trainer.fit(model, datamodule=dm)
edenlightning commented 3 years ago

@SkafteNicki