Closed eypros closed 4 years ago
@eypros Thanks for the report.
This is intended behavior. "Actual" LR is, in fact, not lr
; LR is scaled by the betas (regular Adam), then by eta_t
. Unlike tf.keras
optimizers, the keras
implementations do have an lr_t
to track the true LR. It was a design decision to omit it from tf.keras
per performance concerns - but admittedly, it is a useful feature, and performance impact might be negligible. I'll consider it for the next release.
You can verify that eta_t
is effective with code below. I'll pin this issue for now in case anyone else wonders; feel free to re-open if any further concerns (or just comment).
P.S., setting the TF_EAGER environment variable is redundant; it's done in testing to control eager/graph behavior in the tests
directory, but keras_adamw
detects it automatically.
Actually... you'll see the bias weights do change. In fact, it'll always be the very last weight in the network. This is a legitimate bug, and I'll fix it soon (Issue here); in the meantime, you can apply the fix below in your local install:
Rearrange code in _resource_apply_dense
and _resource_apply_sparse
as follows (keep var_update
as-is, move others below it):
var_update = state_ops.assign(var, var_t, use_locking=self._use_locking)
# Learning rate multipliers
# Cosine annealing
(iteration_done, t_cur_update, eta_t_update
) = _update_t_cur_eta_t_apply_lr_mult(self, lr_t, var)
if iteration_done and not self._init_notified:
self._init_notified = True
Fixed in v1.32, and added lr_t
. See updated example.py.
I will examine the changes you inserted but as a first comment I was setting TF_EAGER explicitly because in my case it's unset and it complains when checked for the actual value.
@eypros That's strange - what's the "complaint", a warning? And which TF version?
I am a bit confused about the actual optimizers lr at each batch.
I have noticed that you there is a (now closed) issue regarding the Usage & concept questions where you refer to the actual lr (learning rate) being
lr*eta_t
.But if I use your example as basis and include a plotting of the lr at each batch there does not appear to be any fluctuation of actual
lr
regardless of the valueseta_t
is assigned to.