JoshVarty / pytorch-retinanet

Reproducing the Detectron implementation of RetinaNet
MIT License
0 stars 1 forks source link

Learning Rate #5

Closed JoshVarty closed 5 years ago

JoshVarty commented 5 years ago

I'd like to match our learning rate to Detectrons.

In the config they define:

SOLVER:
  WEIGHT_DECAY: 0.0001
  LR_POLICY: steps_with_decay
  BASE_LR: 0.00125
  GAMMA: 0.1
  MAX_ITER: 720000
  STEPS: [0, 480000, 640000]
main()
    train_model()
      get_lr_at_iter(it)
            lr_func_steps_with_decay(cur_iter)
                  get_step_index(cur_iter)
      UpdateWorkspaceLr(it)
JoshVarty commented 5 years ago

get_step_index(cur_iter)

Source

  1. Create steps that includes MAX_ITER
    • [0, 480000, 640000, 720000]
  2. Loop over steps
    for ind, step in enumerate(steps): 
        if cur_iter < step:
            break
    return ind - 1
JoshVarty commented 5 years ago

lr_func_steps_with_decay(cur_iter)

Source

def lr_func_steps_with_decay(cur_iter):
    """For cfg.SOLVER.LR_POLICY = 'steps_with_decay'

    Change the learning rate specified iterations based on the formula
    lr = base_lr * gamma ** lr_step_count.

    Example:
    cfg.SOLVER.MAX_ITER: 90
    cfg.SOLVER.STEPS:    [0,    60,    80]
    cfg.SOLVER.BASE_LR:  0.02
    cfg.SOLVER.GAMMA:    0.1
    for cur_iter in [0, 59]   use 0.02 = 0.02 * 0.1 ** 0
                 in [60, 79]  use 0.002 = 0.02 * 0.1 ** 1
                 in [80, inf] use 0.0002 = 0.02 * 0.1 ** 2
    """
    ind = get_step_index(cur_iter)
    return cfg.SOLVER.BASE_LR * cfg.SOLVER.GAMMA ** ind

Porting their example to our values:

Example:
cfg.SOLVER.MAX_ITER: 720000
cfg.SOLVER.STEPS:    [0, 480000, 640000]
cfg.SOLVER.BASE_LR:  0.00125
cfg.SOLVER.GAMMA:    0.1
for cur_iter in [0, 480000]       use 0.00125 = 0.00125 * 0.1 ** 0
             in [480000, 640000]  use 0.000125 = 0.00125 * 0.1 ** 1
             in [640000, inf]     use 0.0000125 = 0.00125 * 0.1 ** 2
JoshVarty commented 5 years ago

get_lr_at_iter(it)

After getting the scheduled learning rate, we make it smaller during the first 500 iterations in order to "warm up" the learning.

Source

  1. Get the learning rate according to our schedule.
  2. If we're done wamrup (eg it > 500) just return lr
  3. Otherwise, we're going to use a linear warmup method.
  4. alpha = it / cfg.SOLVER.WARM_UP_ITERS
    • 0.0 = 0 / 500
  5. warmup_factor = cfg.SOLVER.WARM_UP_FACTOR * (1 - alpha) + alpha
    • 0.3333 = 0.333 * (1 - 0) + 0
  6. Adjust lr by the warm up factor. lr *= warmup_factor
    • 0.0004166 = 0.00125 * 0.3333
JoshVarty commented 5 years ago

We're using the correct LR already. We aren't correcting for momentum, but that's not what's causing the issues for me at the moment. They don't correct for momentum until there is a large shift in LR.