Behavior of learning rate decay

djshen commented 4 years ago

Currently, the learning rate decay happens after each iteration and the update rule is

lr = config.lr/(1 + args.lr_decay*step)

So, the learning rate of step 0 and 1 will be the same value config.lr. Is this the expected behavior? Or, the following is correct

lr = config.lr/(1 + args.lr_decay*(step+1))

cjlin1 commented 4 years ago

We don't know if this causes any significant differences but you can check through experiments as part of the project

On 2020-03-22 11:13, djshen wrote:

Currently, the learning rate decay happens after each iteration and the update rule is

lr = config.lr/(1 + args.lr_decay*step)

So, the learning rate of step 0 and 1 will be the same value config.lr. Is this the expected behavior? Or, the following is correct

lr = config.lr/(1 + args.lr_decay*(step+1))

-- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/cjlin1/simpleNN/issues/3", "url": "https://github.com/cjlin1/simpleNN/issues/3", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Links:

[1] https://github.com/cjlin1/simpleNN/issues/3 [2] https://github.com/notifications/unsubscribe-auth/ABI3BHTBNEU2VUMHVHKYOFLRIV65DANCNFSM4LREQZ3A

djshen commented 4 years ago

If I use tf.keras.optimizers.schedules.InverseTimeDecay in my code, I need to modify either simpleNN or TensorFlow to get exactly "the same results".

djshen commented 4 years ago

The following is a simple example.

import tensorflow as tf

lr_init = 0.1
lr_decay = 0.1

lr_keras = tf.keras.optimizers.schedules.InverseTimeDecay(
        initial_learning_rate=lr_init,
        decay_steps=1,
        decay_rate=lr_decay)
lr_simplenn = lr_init

for step in range(11):
        # keras get the learning rate "before" a batch
        lr_keras_value = lr_keras(step)
        print('Step {:2d}: train one batch with lr_keras {:6f} and lr_simplenn {:6f}'.format(
                step, lr_keras_value, lr_simplenn))
        # simpleNN update the learning "after" a batch
        lr_simplenn = lr_init / (1 + lr_decay * step)

The output is

Step  0: train one batch with lr_keras 0.100000 and lr_simplenn 0.100000
Step  1: train one batch with lr_keras 0.090909 and lr_simplenn 0.100000
Step  2: train one batch with lr_keras 0.083333 and lr_simplenn 0.090909
Step  3: train one batch with lr_keras 0.076923 and lr_simplenn 0.083333
Step  4: train one batch with lr_keras 0.071429 and lr_simplenn 0.076923
Step  5: train one batch with lr_keras 0.066667 and lr_simplenn 0.071429
Step  6: train one batch with lr_keras 0.062500 and lr_simplenn 0.066667
Step  7: train one batch with lr_keras 0.058824 and lr_simplenn 0.062500
Step  8: train one batch with lr_keras 0.055556 and lr_simplenn 0.058824
Step  9: train one batch with lr_keras 0.052632 and lr_simplenn 0.055556
Step 10: train one batch with lr_keras 0.050000 and lr_simplenn 0.052632

If I change step to (step + 1) in the last line, the output will be

Step  0: train one batch with lr_keras 0.100000 and lr_simplenn 0.100000
Step  1: train one batch with lr_keras 0.090909 and lr_simplenn 0.090909
Step  2: train one batch with lr_keras 0.083333 and lr_simplenn 0.083333
Step  3: train one batch with lr_keras 0.076923 and lr_simplenn 0.076923
Step  4: train one batch with lr_keras 0.071429 and lr_simplenn 0.071429
Step  5: train one batch with lr_keras 0.066667 and lr_simplenn 0.066667
Step  6: train one batch with lr_keras 0.062500 and lr_simplenn 0.062500
Step  7: train one batch with lr_keras 0.058824 and lr_simplenn 0.058824
Step  8: train one batch with lr_keras 0.055556 and lr_simplenn 0.055556
Step  9: train one batch with lr_keras 0.052632 and lr_simplenn 0.052632
Step 10: train one batch with lr_keras 0.050000 and lr_simplenn 0.050000

With this modification, I can get exactly the same loss values between simpleNN and its keras counterpart, where I replace almost everything in simpleNN with tf.keras.

cjlin1 / simpleNN

Behavior of learning rate decay #3

Links: