Configuring Multiple Optmizers for different parts of the network

oxinabox commented 5 years ago

@MikeInnes and I had a long discussion of this.

One might reasonably want to use a different optimizer for different parts of the network. Or equivelently, different optimizer params (e.g. learning rate) for different parts of the network.

Under various circumstances,

when constructing networks based on the loss based step!, (rather than gradient and update!)
and/or when using Zygote rather than Tracker,
and/or when making use of a future interface for immutable weight matrixes,
- and in particular when combining this with nonstandard network structure (e.g not Chain but some complex closure) suitable references to parts of the network can be hard to access.

The API we came up with, is an extention of how dropgrad is used. dropgrad(x) from an optimisation perspective says set the learning rate to be zero for x (and all things below x if it isn't itself a network parameter)

The notion is to add a function set_optmarker(x, om) which will like x to a predeclared OptMarker() object, for which optimization rules can be defined. Or if they are not defined, then it would fall back to a default.

set_optmarker would be applied to Arrays, to control the optimizer that is used to adjust values of that array. or to Structure or Closures (i.e. layers), in which case it is applied recursively, so it would apply for example to both the weights and the biases of a dense layer.

And example

o1 = OptMarker()
o2 = OptMarker()

w = randn(200,200)
mdl = function(xs)
    set_optmarker(w, o1) 
    rightval = xs * w.^2

    leftmdl = Chain(Dense ....)
    set_optmarker(leftmdl, o2)
    leftval = lefmdl(xs)

   return rightval + leftval
end

The normal way to optimize (using step! from https://github.com/FluxML/Flux.jl/issues/666#issuecomment-471309222)

would still work:

step!(Adam(0.01), xs_train, ys_train) do xs, ys_target
     ys_pred = mdl(xs)
     (ys_pred .- ys_target)^2
end

But you would also have the option to define an optimize to be used at each location.

step!((o1=>Adam(0.1), o2=>Adam(0.5)), xs_train, ys_train) do xs, ys_target
     ys_pred = mdl(xs)
     (ys_pred .- ys_target)^2
end

set_optmarker would also come in a layer wrapping form. Something like

os = ntuple(OptMarker, 3)
mdl = Chain(
    Dense(20,20, relu) |> WithOptMarker(os[1]),
    Dense(20,20, relu) |> WithOptMarker(os[2]),
    Dense(20,20, relu) |> WithOptMarker(os[3]),
)

where WithOptMarker is defined as

function WithOptMarker(om)
    function(layer)
        function(x)
            set_optmaker(layer, om)(x)
        end
    end
end

For comparason the equiv for dropgrad would be Freeze (which has 1 less level of "currying")

function Freeze(layer)
    function(x)
        dropgads(layer)(x)
    end
end

So the core point of set_optmarker is to get the advantages of sticking the optimizer into the model itself, so that it know what it is optimising, but still keep the flexibility/othogonality to reconfigure what the optimizers actually are them seperately from editting the model. (but possibly a clearer API might be set_optimizer, and then OptMarker is a special kind of optimizer that can be configured later.)

There may be some more elegance that we can do, like if a single optimizer is passed (i.e. the normal case) then that is the same as set_optmarker(loss, opt)

MikeInnes commented 5 years ago

I was initially uneasy about putting what feels like meta-information in the forward pass / model structure, but it's really no different to @show/@showgrad, dropgrad, Freeze, gradient clipping etc.; it's the right way to do this and is consistent with every other way we have of manipulating model training.

The main thing this doesn't seem to address is how we'll deal with optimiser state. For this I'm pretty sure we need something layer-like. I'm imagining writing this as

model = Optimiser(Dense(10, 5), ADAM(0.1))

And you'd be able to do model(x) as usual but it'd obviously behave differently when training. This works nicely with our current update(rule, model, dmodel, state) design: if the model is an Optimiser we just switch ignore the current rule and switch.

At the outermost layer you'd call step!(Optimiser(ADAM(0.01)), loss) and it would internally work with Optimiser(loss, ADAM(0.01)), so there's no difference between the "global" and "local" optimisers.

@oxinabox wanted the extra indirection of OptMarker to be able to still declaratively modify optimisers. I think you can get the benefits of that without the indirection, but would be happy to look at cases that don't appear possible/easy with Optimiser.

oxinabox commented 5 years ago

I also was initially uneasy about this, since it feels a lot like tensorflow's stitching of the optimizer into the graph that defines the network. But then I was thinking about dropgrad. I guess also dropout is the same kind of thing too, it is a connivance that is more to do with training than it is to the real model, as you don't use dropout after training is done

It might indeed be that you can get it without the indirection.

FluxML / Flux.jl

Configuring Multiple Optmizers for different parts of the network #724