Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.26k stars 3.38k forks source link

Cyclic learning rate finder as a part of Trainer #624

Closed suvojit-0x55aa closed 4 years ago

suvojit-0x55aa commented 4 years ago

🚀 Feature

Learning rate finder to plot lr vs loss relationship for Trainer and find a good starting learning rate.

Motivation

Cyclical Learning Rates for Training Neural Networks by Leslie N. Smith documents how to find a good learning rate for training with CyclicLR scheduler.

Pitch

Adding a methods to the Trainer class:

williamFalcon commented 4 years ago

this should be a learning rate scheduler no?

suvojit-0x55aa commented 4 years ago

This is not a scheduler per se, but it helps to find a good learning rate to start with that can be used with other optimizers as wells as helpful for finding [start_lr, end_lr] of CyclicLR.

Here is another article regarding this.

FrancescoSaverioZuppichini commented 4 years ago

something like https://docs.fast.ai/callbacks.lr_finder.html right?

suvojit-0x55aa commented 4 years ago

@FrancescoSaverioZuppichini yes. Since Trainer already has access to model and training data it will be a great feature for the Lightning community.

FrancescoSaverioZuppichini commented 4 years ago

I totally agree. Maybe it can be easy copied directly from fastai.

williamFalcon commented 4 years ago

let’s not copy anything from fast.ai. I’d rather be able to import from fast.ai and use it.

I’d like Lightning to work well with the other libraries.

so, the flow should be to allow support for this component from fast.ai and maybe generalize a bit to enable other components to work with lightning

FrancescoSaverioZuppichini commented 4 years ago

IMHO it doesn't make any sense to force the user to install fastai only to use a subfeature that can (and should) be present in lighting. Lighting should replace other libraries like fastai.

For example, I don't like fastai, the code base is not great and the doc is terrible, I would like to avoid installing it again on my machine to just use one feature.

suvojit-0x55aa commented 4 years ago

@FrancescoSaverioZuppichini @williamFalcon lightning works great as the light-weight wrapper it is. it provides flexibility as well as extensibility. I suggested this feature cause it requires a few components to work like the optimizer, dataloader and the model; in Trainer we have all of those in the same place, and the technique is proven to work quite well in practice, so we can take inspiration form libraries like fast.ai, and the Pytorch implemetanion here as well as this keras implementaion here to implement it in lightning.

williamFalcon commented 4 years ago

@tullie @neggert @jeffling @Borda thoughts?

Borda commented 4 years ago

let’s not copy anything from fast.ai. I’d rather be able to import from fast.ai and use it.

totally agree, if they make some correction we would have it too and do not need to dig what is wrong again...

I’d like Lightning to work well with the other libraries.

agree, there is already use for torchvision

so, the flow should be to allow support for this component from fast.ai and maybe generalize a bit to enable other components to work with lightning

maybe something like we did with the logger, that there is an abstract class and then implement this LR

neggert commented 4 years ago

My only strong opinion is that we should not include fastai as a dependency, mostly because fastai has a ton of very heavy dependencies itself that would get pulled in.

williamFalcon commented 4 years ago

it wouldn’t be a dependency. i mean the ability to work with other libraries. take the approach we took with mlflow as borda suggested

jeffling commented 4 years ago

I actually did a bit of research into this and implemented this at work. It's actually very easy.

fastai's implementation just does a small run while tracking learning rate and loss, and then prints out the chart. They also have an option for finding the 'optimal' learning rate, but it's different for every use-case so even in the course they look at the graph and do it intuitively.

The easiest way to implement this with lightning that I can think of:

  1. is to use a learning rate scheduler that steps through the learning rate range you'd like to explore.
  2. Do a short run (1 epoch) using that learning rate scheduler. Make a model and Trainer and run fit().
  3. Use tensorboard or w&b or anything you want to graph loss vs learning rate (fast ai prints matplotlib graph). Or write some code to find the 'optimal' learning rate using the emitted logs.
  4. Choose your learning rate
  5. Plug in that number into a new Trainer/Model instance (remember to set the old one to .cpu()). If you used this technique you'll probably want to use another scheduler.
  6. Run Trainer.fit as you want.

Regarding using fast.ai, I don't think it would be possible to just use it, as the user would need to have a fast.ai model as well. Maybe there is an adapter we can provide in the future.

I suggested this feature cause it requires a few components to work like the optimizer, dataloader and the model

Regarding the optimizer, dataloader, model, I think we don't need any improvements there as you can do everything with Trainer already. BUT, we currently do not have the ability to .step() the learning rate scheduler every iteration, so that is the main blocker.

You can easily work around this by keeping a reference to the scheduler and stepping yourself, but lightning could also add this functionality.

TLDR: work needed for this:

suvojit-0x55aa commented 4 years ago

@williamFalcon @jeffling should we track the feature of stepping LR schedulers at every iteration in a separate issue ?

FrancescoSaverioZuppichini commented 4 years ago

@jeffling sounds great, I think it should be easy to add it to the trainer

williamFalcon commented 4 years ago

@FrancescoSaverioZuppichini @suvojit-0x55aa want to implement for lightning and submit a PR? @Borda

Would be great to get this into the next release.

DrClick commented 4 years ago

I have tried doing this in the training_step

current_lr = self.trainer.lr_schedulers[0].get_lr()[0]
current_batch_nb = self.trainer.total_batch_idx
self.logger.experiment.add_scalar("learning_rate", current_lr, current_batch_nb)

...

self.trainer.lr_schedulers[0].step()

which does step the scheduler, however, when trying to use the torch.optim.lr_scheduler.OneCycleLR where number of iterations spans epochs, the learning rate is reset at each epoch to the base learning rate. Does anyone know how I can stop this from happening? For oneCycle, the length of the cycle is normally the entire set of epochs you want to train so reseting at the start of each epoch breaks this.

suvojit-0x55aa commented 4 years ago

@DrClick yes, actually this issue was pointed by @jeffling here. We also need to implement the feature to step scheduler every iteration.

DrClick commented 4 years ago

@suvojit-0x55aa thank you, I am referencing this

You can easily work around this by keeping a reference to the scheduler and stepping yourself, but lightning could also add this functionality.

I have done that with my code snippet, however, at each epoch, the learning rate is reset to the base learning rate for some reason. This results in the learning rate be truncated at every epoch. I suspect this has to do with the call to optimizer.step() and possibly I am changing the learning rate of scheduler out of context... Any thoughts? Additionally, at the end of each epoch, in the training_loop.py, it calls scheduler.step which adds num_gpus extra steps.

Borda commented 4 years ago

@FrancescoSaverioZuppichini @suvojit-0x55aa @DrClick any thought on the implementation?

DrClick commented 4 years ago

I am still looking into why this happens. I am happy to make a PR when I find a solution. I think this is a pretty critical issue, it flatly goes against the 1st stated design principle of "no pytorch interference"

suvojit-0x55aa commented 4 years ago

@Borda @DrClick I looked into it, but still not able to pinpoint the issue, i'll update here if I find anything.

DrClick commented 4 years ago

I have found the issue and have a solution but I would like to discuss the possible solutions. I am happy to submit a PR. Basically, the training loop calls lr_scheduler.step(epoch=self.current_epoch) in the training loop. This has the affect of resetting an iterative learning rate scheduler. Possible solution: Expose a method on the trainer to step iterative schedulers and return tuple or dict of optimizers, epoch_schedulers, and iterative schedulers from model.configure_optimizers. @williamFalcon if this sounds good I will code this up.

This however leads to my second problem, I got this to work(by catching the call to my scheduler and ignoring calls where the epoch is provided) , but what I cannot find is a clean way train for a while, stop training, change the learning rate schedulers, and continue training. I have worked around it with the following, but this is clearly pretty unsatisfactory. I was wondering on your thoughts about this and if interested I can start an issue and work on it.

    print("Entering Phase 1. Freezing the base resnet")
    model.freeze_to(-1)
    trainer.fit(model)

    print("Entering Phase 2. Unfreezing the base resnet and lowering learning rates")

    # load the best model and unfreeze the resnet layer
    best_model_ckpt, best_epoch = get_best_model_ckpt(save_checkpoint_dir)

    print("\treload the best model from phase 1 of the training")
    model = BaseModel.load_from_checkpoint(best_model_ckpt)

    print("\tunfreezing the base layer")
    model.unfreeze()

    print("\tsetting a new learning rate for the head")
    # We are going to manually resume this from the next epoch and keep the same global step.
    # This skips a bunch of things we want to skip mainly resetting the global step and epoch
    # to that of the checkpoint
    trainer.resume_from_checkpoint = None

    # step trainer bookeeping manually
    trainer.current_epoch += 1
    trainer.global_step += 1
    trainer.max_epochs += hparams.phase_2_cycle_epochs

     # reset the hyperparmeters and recreate thes schedulers
    trainer.get_model().hparams.max_learning_rate_head = hparams.phase_2_max_learning_rate_head
    trainer.get_model().hparams.cycle_epochs = hparams.phase_2_cycle_epochs
    trainer.optimizers, trainer.lr_schedulers = trainer.init_optimizers(trainer.get_model().configure_optimizers())

    # reenable the progress bar for training
    pbar = tqdm.tqdm(leave=True, position=2 * trainer.process_position,
                    disable=not trainer.show_progress_bar, dynamic_ncols=True, unit='batch',
                    file=sys.stdout)
    trainer.main_progress_bar = pbar

    # clear cache before training
    if trainer.on_gpu:
        torch.cuda.empty_cache()

    # resume training without reinit
    trainer.train()

    print("Training completed.")
FrancescoSaverioZuppichini commented 4 years ago

And this library should make things easier 😂

schwobr commented 4 years ago

The way I implemented one-cycle is by completely overwriting lightning's base scheduler implementation. Basically I add a step_on_batch attribute to every scheduler, which is set to True for schedulers that need to be updated at every batch (like OneCycleLR). I then store the scheduler as an attribute of the model, and use hooks to update it, like:

    def on_batch_end(self):
        if self.sched is not None and self.sched.step_on_batch:
            self.sched.step()

    def on_epoch_end(self):
        if self.sched is not None and not self.sched.step_on_batch:
            self.sched.step()

Note that you can also use hooks to reset things between two phases of training.

teichert commented 4 years ago

Thanks for this feature! The [Leslie Smith paper]() recommends using the LR sweep to choose lower and upper bounds (a.k.a. base_lr and max_lr respectively) for the Cyclic Learning Rate scheduler. Am I right that what is implemented here sweeps learning rates, allows users to inspect results, and suggests a reasonable learning rate, BUT that it isn't immediately useable for setting the parameters of the CyclicLR scheduler? Furthermore, if I use the CyclicLR scheduler (i.e. I return it along with my optimizer from configure_optimizers), won't the LR sweep also be using that CLR scheduler as well (which I don't think I want)?

(I'm new to pytorch-lightning, so I'm guessing that I'm just missing something obvious and that this is already set up to work easily. )

kswamy15 commented 4 years ago

The way I implemented one-cycle is by completely overwriting lightning's base scheduler implementation. Basically I add a step_on_batch attribute to every scheduler, which is set to True for schedulers that need to be updated at every batch (like OneCycleLR). I then store the scheduler as an attribute of the model, and use hooks to update it, like:

    def on_batch_end(self):
        if self.sched is not None and self.sched.step_on_batch:
            self.sched.step()

    def on_epoch_end(self):
        if self.sched is not None and not self.sched.step_on_batch:
            self.sched.step()

Note that you can also use hooks to reset things between two phases of training.

Can you elaborate on this further. I have my optimizer like this: def configure_optimizers(self):

        # REQUIRED
        # can return multiple optimizers and learning_rate schedulers
        # (LBFGS it is automatically supported, no need for closure function)
        optimizer = torch.optim.Adam([p for p in self.parameters() if p.requires_grad], lr=self.hparams.learning_rate, eps=1e-08)
        scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=2e-4, total_steps=1000)
        return [optimizer], [scheduler]

so should I declare self.sched = scheduler here in this function? How do I add a 'step_on_batch' attribute to the scheduler here?
Thanks in advance for you help.

kswamy15 commented 4 years ago

I used the code above for on_batch_end and on_epoch_end and was able to change the learning rate every batch. I used a print statement in the on_batch_end to verify that it did change. So this hack works. for group in self.optim.param_groups: print('learning rate', group['lr'])

tbenst commented 4 years ago

@teichert did you come up with a solution / "best practice" for using the lr finder with one cycle? Appreciate any tips or pointers!

Edit: my current understanding is that the auto_lr_finder is not currently appropriate for cyclic learning rate as it currently fits in the middle of the lr range. Instead, we need to calculate the base and max learning rate.

In this photo: lr vs loss

We would want (approximately) lr_base=5e-5 and lr_max=10e-3, where the just start to see convergence and see the lowest loss, respectively.