Actual lr seems fixed during training

eypros commented 4 years ago

I am a bit confused about the actual optimizers lr at each batch.

I have noticed that you there is a (now closed) issue regarding the Usage & concept questions where you refer to the actual lr (learning rate) being lr*eta_t.

But if I use your example as basis and include a plotting of the lr at each batch there does not appear to be any fluctuation of actual lr regardless of the values eta_t is assigned to.

from tensorflow.keras import backend as K
import os
os.environ["TF_KERAS"] = '1'
os.environ["TF_EAGER"] = '0'

from tensorflow.keras.layers import Input, Dense, LSTM
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l1, l2, l1_l2

import numpy as np
import matplotlib.pyplot as plt

from keras_adamw import AdamW
from keras_adamw.utils import K_eval

USE_CPU = True

if USE_CPU:
    os.environ['CUDA_VISIBLE_DEVICES'] = ''

ipt = Input(shape=(120, 4))
x = LSTM(60, activation='relu', name='lstm_1',
         kernel_regularizer=l1(1e-4), recurrent_regularizer=l2(2e-4))(ipt)
out = Dense(1, activation='sigmoid', kernel_regularizer=l1_l2(1e-4, 2e-4))(x)
model = Model(ipt, out)

lr_multipliers = {'lstm_1': 0.5}

optimizer = AdamW(lr=1e-4, model=model, lr_multipliers=lr_multipliers,
                  use_cosine_annealing=True, total_iterations=24)
model.compile(optimizer, loss='binary_crossentropy')

eta_history = []
lr_history = []
for epoch in range(3):
    for iteration in range(24):
        x = np.random.rand(10, 120, 4)  # dummy data
        y = np.random.randint(0, 2, (10, 1))  # dummy labels
        loss = model.train_on_batch(x, y)
        eta_t = K_eval(model.optimizer.eta_t, K)
        eta_history.append(eta_t)
        t_cur = K_eval(model.optimizer.t_cur, K)
        lr = K_eval(model.optimizer.lr, K)  # K.eval(model.optimizer.lr)
        lr_history.append(lr)
        eta_max = K_eval(model.optimizer.eta_max, K)
        eta_min = K_eval(model.optimizer.eta_min, K)

        print('Iter {} t_cur: {} - lr: {} - eta_max: {} - eta_min: {}'.format(iteration + 1, t_cur, lr, eta_max, eta_min))
        print("Iter {} loss: {} - eta_t: {}".format(iteration + 1, "%.3f" % loss, eta_t))
        if iteration == (24 - 2):
            K.set_value(model.optimizer.t_cur, -1)  # WARM RESTART
    print("EPOCH {} COMPLETED\n".format(epoch + 1))

plt.plot(eta_history, linewidth=2)
plt.xlim(0, len(eta_history))
plt.ylim(0, 1.05)
plt.ylabel('eta_t', weight='bold', fontsize=15)
plt.xlabel('Train iterations', weight='bold', fontsize=15)
plt.gcf().set_size_inches(10, 5)
plt.show()
plt.close()

plt.plot(lr_history, linewidth=2)
plt.xlim(0, len(lr_history))
plt.ylim(0.9*np.min(lr_history), 1.1*np.max(lr_history))
plt.ylabel('lr', weight='bold', fontsize=15)
plt.xlabel('Train iterations', weight='bold', fontsize=15)
plt.gcf().set_size_inches(10, 5)
plt.show()

Iter 1 t_cur: 1 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 1 loss: 0.691 - eta_t: 0.9953429698944092 Iter 2 t_cur: 2 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 2 loss: 0.694 - eta_t: 0.9814586639404297 Iter 3 t_cur: 3 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 3 loss: 0.704 - eta_t: 0.9586056470870972 Iter 4 t_cur: 4 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 4 loss: 0.689 - eta_t: 0.927209734916687 Iter 5 t_cur: 5 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 5 loss: 0.682 - eta_t: 0.8878556489944458 Iter 6 t_cur: 6 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 6 loss: 0.708 - eta_t: 0.8412765264511108 Iter 7 t_cur: 7 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 7 loss: 0.684 - eta_t: 0.788340151309967 Iter 8 t_cur: 8 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 8 loss: 0.691 - eta_t: 0.7300325036048889 Iter 9 t_cur: 9 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 9 loss: 0.701 - eta_t: 0.6674398183822632 Iter 10 t_cur: 10 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 10 loss: 0.690 - eta_t: 0.6017280220985413 Iter 11 t_cur: 11 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 11 loss: 0.699 - eta_t: 0.5341211557388306 Iter 12 t_cur: 12 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 12 loss: 0.699 - eta_t: 0.46587878465652466 Iter 13 t_cur: 13 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 13 loss: 0.687 - eta_t: 0.39827197790145874 Iter 14 t_cur: 14 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 14 loss: 0.713 - eta_t: 0.3325602114200592 Iter 15 t_cur: 15 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 15 loss: 0.709 - eta_t: 0.2699674367904663 Iter 16 t_cur: 16 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 16 loss: 0.688 - eta_t: 0.21165981888771057 Iter 17 t_cur: 17 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 17 loss: 0.692 - eta_t: 0.15872341394424438 Iter 18 t_cur: 18 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 18 loss: 0.687 - eta_t: 0.1121443510055542 Iter 19 t_cur: 19 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 19 loss: 0.684 - eta_t: 0.07279029488563538 Iter 20 t_cur: 20 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 20 loss: 0.693 - eta_t: 0.04139435291290283 Iter 21 t_cur: 21 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 21 loss: 0.699 - eta_t: 0.018541336059570312 Iter 22 t_cur: 22 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 22 loss: 0.699 - eta_t: 0.00465703010559082 Iter 23 t_cur: 23 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 23 loss: 0.678 - eta_t: 0.0 Iter 24 t_cur: 0 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0 Iter 24 loss: 0.696 - eta_t: 1.0 EPOCH 1 COMPLETED

OverLordGoldDragon commented 4 years ago

@eypros Thanks for the report.

This is intended behavior. "Actual" LR is, in fact, not lr; LR is scaled by the betas (regular Adam), then by eta_t. Unlike tf.keras optimizers, the keras implementations do have an lr_t to track the true LR. It was a design decision to omit it from tf.keras per performance concerns - but admittedly, it is a useful feature, and performance impact might be negligible. I'll consider it for the next release.

You can verify that eta_t is effective with code below. I'll pin this issue for now in case anyone else wonders; feel free to re-open if any further concerns (or just comment).

Code

```python import random random.seed(0) import numpy as np np.random.seed(1) import tensorflow as tf tf.compat.v1.set_random_seed(2) # graph-level seed if tf.__version__[0] == '2': tf.random.set_seed(3) # global seed else: tf.set_random_seed(3) # global seed import os os.environ['TF_KERAS'] = '1' from tensorflow.keras.layers import Input, Dense from tensorflow.keras.models import Model from tensorflow.keras import backend as K from keras_adamw import AdamW #%%########################################################################## ipt = Input(batch_shape=(32, 4)) out = Dense(4)(ipt) model = Model(ipt, out) opt = AdamW(lr=1e-3, use_cosine_annealing=True, total_iterations=25) model.compile(opt, 'mse') x = y = np.random.randn(*model.input_shape) K.set_value(opt.eta_t, 0) K.set_value(opt.t_cur, opt.total_iterations - 2) W_pre = model.get_weights() model.train_on_batch(x, y) W_post = model.get_weights() #%%########################################################################## print("PRE-TRAIN:\n%s" % W_pre) print("POST_TRAIN:\n%s" % W_post) print("DIFF:\n{}\n{}".format(W_post[0] - W_pre[0], W_post[1] - W_pre[1])) ```

P.S., setting the TF_EAGER environment variable is redundant; it's done in testing to control eager/graph behavior in the tests directory, but keras_adamw detects it automatically.

OverLordGoldDragon commented 4 years ago

Actually... you'll see the bias weights do change. In fact, it'll always be the very last weight in the network. This is a legitimate bug, and I'll fix it soon (Issue here); in the meantime, you can apply the fix below in your local install:

Rearrange code in _resource_apply_dense and _resource_apply_sparse as follows (keep var_update as-is, move others below it):

var_update = state_ops.assign(var, var_t, use_locking=self._use_locking)

# Learning rate multipliers
# Cosine annealing
(iteration_done, t_cur_update, eta_t_update
 ) = _update_t_cur_eta_t_apply_lr_mult(self, lr_t, var)
if iteration_done and not self._init_notified:
    self._init_notified = True

OverLordGoldDragon commented 4 years ago

Fixed in v1.32, and added lr_t. See updated example.py.

eypros commented 4 years ago

I will examine the changes you inserted but as a first comment I was setting TF_EAGER explicitly because in my case it's unset and it complains when checked for the actual value.

OverLordGoldDragon commented 4 years ago

@eypros That's strange - what's the "complaint", a warning? And which TF version?

OverLordGoldDragon / keras-adamw

Actual lr seems fixed during training #46