Open oxinabox opened 5 years ago
I was initially uneasy about putting what feels like meta-information in the forward pass / model structure, but it's really no different to @show
/@showgrad
, dropgrad
, Freeze
, gradient clipping etc.; it's the right way to do this and is consistent with every other way we have of manipulating model training.
The main thing this doesn't seem to address is how we'll deal with optimiser state. For this I'm pretty sure we need something layer-like. I'm imagining writing this as
model = Optimiser(Dense(10, 5), ADAM(0.1))
And you'd be able to do model(x)
as usual but it'd obviously behave differently when training. This works nicely with our current update(rule, model, dmodel, state)
design: if the model is an Optimiser
we just switch ignore the current rule and switch.
At the outermost layer you'd call step!(Optimiser(ADAM(0.01)), loss)
and it would internally work with Optimiser(loss, ADAM(0.01))
, so there's no difference between the "global" and "local" optimisers.
@oxinabox wanted the extra indirection of OptMarker
to be able to still declaratively modify optimisers. I think you can get the benefits of that without the indirection, but would be happy to look at cases that don't appear possible/easy with Optimiser
.
I also was initially uneasy about this, since it feels a lot like tensorflow's stitching of the optimizer into the graph that defines the network.
But then I was thinking about dropgrad
.
I guess also dropout
is the same kind of thing too, it is a connivance that is more to do with training than it is to the real model, as you don't use dropout after training is done
It might indeed be that you can get it without the indirection.
@MikeInnes and I had a long discussion of this.
One might reasonably want to use a different optimizer for different parts of the network. Or equivelently, different optimizer params (e.g. learning rate) for different parts of the network.
Under various circumstances,
step!
, (rather thangradient
andupdate!
)Chain
but some complex closure) suitable references to parts of the network can be hard to access.The API we came up with, is an extention of how
dropgrad
is used.dropgrad(x)
from an optimisation perspective says set the learning rate to be zero forx
(and all things belowx
if it isn't itself a network parameter)The notion is to add a function
set_optmarker(x, om)
which will likex
to a predeclaredOptMarker()
object, for which optimization rules can be defined. Or if they are not defined, then it would fall back to a default.set_optmarker
would be applied to Arrays, to control the optimizer that is used to adjust values of that array. or to Structure or Closures (i.e. layers), in which case it is applied recursively, so it would apply for example to both the weights and the biases of a dense layer.And example
The normal way to optimize (using
step!
from https://github.com/FluxML/Flux.jl/issues/666#issuecomment-471309222)would still work:
But you would also have the option to define an optimize to be used at each location.
set_optmarker
would also come in a layer wrapping form. Something likewhere
WithOptMarker
is defined asFor comparason the equiv for
dropgrad
would beFreeze
(which has 1 less level of "currying")So the core point of
set_optmarker
is to get the advantages of sticking the optimizer into the model itself, so that it know what it is optimising, but still keep the flexibility/othogonality to reconfigure what the optimizers actually are them seperately from editting the model. (but possibly a clearer API might beset_optimizer
, and thenOptMarker
is a special kind of optimizer that can be configured later.)There may be some more elegance that we can do, like if a single optimizer is passed (i.e. the normal case) then that is the same as
set_optmarker(loss, opt)