Compatibility with ProtoStruct.jl, and LayerFactory ideas for custom layers

MilesCranmer commented 2 years ago

There's this really nice package ProtoStruct.jl that lets you create structs which can be revised. I think this is extremely useful for developing custom models in Flux.jl using Revise.jl, since otherwise I would need to restart every time I want to add a new property in my model.

Essentially the way it works is to transform:

@proto struct MyLayer
    chain1::Chain
    chain2::Chain
end

into (regardless of the properties)

struct MyLayer{NT<:NamedTuple}
    properties::NT
end

and, inside the macro, set up constructors based on your current defined properties.

However, right now it doesn't work with Flux.jl. When I try to get the parameters from a model, I see the error: NamedTuple has no field properties. Here is a MWE:

# Desired API for FluxML
using Flux
using Flux: params, @functor
using ProtoStructs

@proto struct ResidualDense
    w1::Dense
    w2::Dense
    act::Function
end

"""Residual layer."""
function (r::ResidualDense)(x)
    dx = r.w2(r.act(r.w1(x)))
    return r.act(dx + x)
end

@functor ResidualDense

function ResidualDense(in, out; hidden=128, act=relu)
    ResidualDense(Dense(in, hidden), Dense(hidden, out), act)
end

# Chain of linear layers:
mlp = Chain(
    Dense(5 => 128),
    ResidualDense(128, 128),
    ResidualDense(128, 128),
    Dense(128 => 1),
);

p = params(mlp);  # Fails

and here is the error:

ERROR: type NamedTuple has no field properties
Stacktrace:
  [1] getproperty
    @ ./Base.jl:38 [inlined]
  [2] getproperty(o::ResidualDense{NamedTuple{(:w1, :w2, :act), Tuple{Dense, Dense, Function}}}, s::Symbol)
    @ Main ~/.julia/packages/ProtoStructs/4sIVY/src/ProtoStruct.jl:134
  [3] functor(#unused#::Type{ResidualDense{NamedTuple{(:w1, :w2, :act), Tuple{Dense, Dense, Function}}}}, x::ResidualDense{NamedTuple{(:w1, :w2, :act), Tuple{Dense, Dense, Function}}})
    @ Main ~/.julia/packages/Functors/V2McK/src/functor.jl:19
  [4] functor(x::ResidualDense{NamedTuple{(:w1, :w2, :act), Tuple{Dense, Dense, Function}}})
    @ Functors ~/.julia/packages/Functors/V2McK/src/functor.jl:3
  [5] trainable(x::ResidualDense{NamedTuple{(:w1, :w2, :act), Tuple{Dense, Dense, Function}}})
    @ Optimisers ~/.julia/packages/Optimisers/GKFy2/src/interface.jl:153
  [6] params!
    @ ~/.julia/packages/Flux/nJ0IB/src/functor.jl:46 [inlined]
  [7] params!(p::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}}, x::Tuple{Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, ResidualDense{NamedTuple{(:w1, :w2, :act), Tuple{Dense, Dense, Function}}}, ResidualDense{NamedTuple{(:w1, :w2, :act), Tuple{Dense, Dense, Function}}}, ResidualDense{NamedTuple{(:w1, :w2, :act), Tuple{Dense, Dense, Function}}}, ResidualDense{NamedTuple{(:w1, :w2, :act), Tuple{Dense, Dense, Function}}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}, seen::Base.IdSet{Any}) (repeats 3 times)
    @ Flux ~/.julia/packages/Flux/nJ0IB/src/functor.jl:47
  [8] params!
    @ ~/.julia/packages/Flux/nJ0IB/src/functor.jl:40 [inlined]
  [9] params(m::Chain{Tuple{Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, ResidualDense{NamedTuple{(:w1, :w2, :act), Tuple{Dense, Dense, Function}}}, ResidualDense{NamedTuple{(:w1, :w2, :act), Tuple{Dense, Dense, Function}}}, ResidualDense{NamedTuple{(:w1, :w2, :act), Tuple{Dense, Dense, Function}}}, ResidualDense{NamedTuple{(:w1, :w2, :act), Tuple{Dense, Dense, Function}}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}})
    @ Flux ~/.julia/packages/Flux/nJ0IB/src/functor.jl:87
 [10] top-level scope
    @ ~/desired_julia_api/nice_api.jl:37

How hard would it be to make this compatible? I think it would be extremely useful to be able to quickly revise model definitions!

(Sorry for the spam today, by the way)

mcabbott commented 2 years ago

This is because:

julia> fieldnames(ResidualDense)
(:properties,)

julia> propertynames(ResidualDense)
(:var, :body)

and Functors isn't careful: it checks fieldnames but calls getproperty:

https://github.com/FluxML/Functors.jl/blob/v0.3.0/src/functor.jl#L11-L16

One fix is:

https://github.com/FluxML/Flux.jl/pull/1932/files#diff-34ea0dd342eeb0864015f5ab93bd9db67f6f46ec46d4bf8dbebe9b8c1a433b04R131-R132

Another fix would be for ProtoStructs to make getproperty work for the extra (internal) name. It could have a weird, possibly generated, name to avoid clashes.

MilesCranmer commented 2 years ago

Nice! Thanks for figuring this out.

Aside: I wonder if something like @proto should be used by default for Flux.jl custom layers? It seems like such a common use-case to modify structs when you are developing neural nets. I suppose it wouldn't hurt performance either.

Here's a terrible idea I had, just to see what if it spurs any useful ideas for others:

@layer struct ResBlock
    w1::Dense
    act::Function
    forward = (self, x, y) -> begin
        self.act(self.w1(x)))
    end
end

where @layer would basically do both what @proto and @functor (as well as @kwdef, for the forward) currently do. The forward property of ResBlock here would let you embed the forward function actually inside the struct declaration. I guess it's kind of against Julia style, but it seems intuitive for quickly building deep neural nets.

So, from this, the macro would generate the following code:

struct ResBlock{NT<:NamedTuple}
    properties::NT
end

function getproperty
    ...
end
# etc., etc.

function (self::ResBlock)(x, y)
    self.forward(self, x, y)
end

MilesCranmer commented 2 years ago

(Maybe I am thinking of trying this out: https://github.com/Suzhou-Tongyuan/ObjectOriented.jl - it's definitely useful for DL-type models)

darsnack commented 2 years ago

I don't know if we want that as default @layer, because eventually you want to remove the @proto specification (this is explicitly stated even in the ProtoStructs.jl readme).

Supporting ProtoStruct layers is good, because sometimes writing your own layer is unavoidable. But Flux has some features that make using different types (classes) for each sub-layer not the norm. For example, Metalhead.jl, the vision model library, can build almost all types of residual networks without defining a ResBlock type. This is because of two things in Flux that most frameworks lack:

a built-in Parallel layer (and special case SkipConnection) which encompasses most branching network architectures
the ability to name layers in a Chain or Parallel and access sub-layers via those names (so that you don't just see an ocean of Parallel)

(Btw this is not to shoot anything down; if what we have doesn't really do what you want, then we want to know. Love the enthusiasm!)

ToucheSir commented 2 years ago

To build on Kyle's point: Flux's philosophy towards layer types is very much "come as you are". We'd like to make it easier to define/register existing structs as layers and to do so without depending on Flux itself. So while we should absolutely try to support libraries like ProtoStruct and ObjectOriented.jl if we can, we also want to keep the barrier of entry for working with the module system as low as possible (if you know how to define a callable struct, you can make a layer).

MilesCranmer commented 2 years ago

Thanks for all the comments on my admittedly half-thought-out ideas! Did not know about Parallel - nice!

I don't know if we want that as default @layer, because eventually you want to remove the @proto specification (this is explicitly stated even in the ProtoStructs.jl readme).

Good point, agreed!

I wonder if you would want to mention Revise.jl+ProtoStructs.jl on the custom layer page (or even @reexport some of the key functionality, so you could do Flux.revise() at the top of the REPL to turn it on), since it seems almost required when developing neural nets; otherwise the startup time really hurts when working on complex networks.

For a lot of users, my sense is that Flux.jl may be the very first Julia package they try out.

We also want to keep the barrier of entry for working with the module system as low as possible (if you know how to define a callable struct, you can make a layer).

The compatibility makes a lot of sense. I am just trying to think if there's any way to simplify the current custom layer declaration method for new users. Right now you need to call struct MyStruct, function (m::MyStruct)(...), and @functor, and then also construct the components of the layer separately. This feels unwieldy.

Not only is it four separate calls, it's four different types of calls. I need to create (1) a stuct, then (2) a method, (3) call a macro, then separately (4) construct the layer's components. And I need to do that for every single custom layer. In Haiku and PyTorch, you would only create a (1) class and (2) method – sometimes you can even create a custom layer with just a method. Using four different ideas to make a simple NN layer just seems a bit heavy, and the self methods in Julia (like function (m::MyType)(x)) are admittedly ugly, and seem more appropriate for use by package developers rather than end users. That functionality always seemed more suitable for designing convenience methods in a package, rather than working on user objects.

Even something like

struct MyLayer
    x::Dense
    y::Dense
end

@layermethod (m::MyLayer)(z) -> relu(m.x(z) + m.y(z))

might make for a slightly cleaner convenience API. Though the standard methods would of course also be available.

Thoughts? I am aware I could simply be too comfortable with the Python DL ecosystem, though, and this isn't Julia-esque enough. No worries if that is the case.

I think my dream layer creation method would really be something like

@layerfactory function my_layer(n_in, n_out)
    w1 = Dense(n_in, 128)
    w2 = Dense(128, n_out)
    return (x) -> begin
        y = relu(w1(x)) + x
        w2(y)
    end
end

which would let you construct the layer struct, the layer's components, and the forward pass, all in one go. (But that would obviously be tricky to implement).

Best, Miles

mcabbott commented 2 years ago

create (1) a stuct, then (2) a method, (3) call a macro, then separately (4) construct the layer's components.

What's nice about 1,2,4 is that there is nothing Flux-specific about them. They are completely ordinary Julia code.

Making a special DSL isn't impossible, but it's one more thing you have to learn, and it will have limitations. This is a little bit like the train! story, where saving a few lines (compared to writing out the ordinary Julia code of the for loop) comes at the cost of weird API to memorise, and limitations so that you later have to know the other way too.

Most things for which it's worth defining a new layer will want something extra. If I'm reading correctly the example here is an easy combination of existing layers:

Chain(SkipConnection(Dense(n_in => 128, relu), +), Dense(128 => nout))

Some of us would like to remove @functor and make it just recurse by default. Then perhaps @layer would be an optional way to add fancy printing (or to customise which bits are trainable) but not needed for the simplest layers.

I agree we should mention Revise.jl & maybe ProtoStructs.jl. On the ecosystem page for sure. Maybe ProtoStructs.jl ought to be on the advanced layer building page too? (Which could use some cleaning up.)

MilesCranmer commented 2 years ago

Thanks, I see your point. I just wonder if there's a way to still use ordinary Julia code but require less manual labor to build custom layers, at the expense of being slightly less generic. Recursing by default would definitely be an improvement!

If I'm reading correctly the example here is an easy combination of existing layers:

This was just a MWE; in practice there would probably not be an existing layer for my applications.

In some ways the layers like Chain/SkipConnection/Dense already forms a DSL. I guess what I am looking for a simpler way to extend this DSL to new layers so I can quickly prototype models for my research. With certain assumptions about what would go into the layer (e.g., similar assumptions to Torch/Flax methods), I think a custom layer factory could be written in a condensed form. Currently the way to make custom layers is 100% generic, and as a result is a bit more code - maybe there could also be a partially-generic method.

MilesCranmer commented 2 years ago

Here's one idea for a layer factory:

struct LayerFactory{F<:Function,NT<:NamedTuple}
    forward::F
    layers::NT
end

LayerFactory(f; layers...) = LayerFactory(f, NamedTuple(layers))

function (f::LayerFactory)(args...)
    return f.forward(f.layers, args...)
end

@functor LayerFactory

This makes it super easy to construct custom layers. Watch this:

my_layer = LayerFactory(; w1=Dense(5, 128), act=relu) do self, x
    self.act(self.w1(x))
end

That's literally all you need! I can construct custom layers in one line, without even changing the underlying structure of Flux.jl. And it works for training and everything.

What do you think about this?

mcabbott commented 2 years ago

This should work fine. What's a nice example of a nontrivial use?

MilesCranmer commented 2 years ago

The use-case I had in mind is graph networks, where you have a set of (nodes, edges, globals) that you need to do scatter operations on - it seems tricky to get that working with a Chain, but it should be doable to set it up with custom layers.

I am really happy about this LayerFactory struct. I'm actually surprised that it just works with this package. You can even change the number of fields in the same runtime, and it still works! Would you be willing to include it in Flux.jl as a simple way to construct custom layers?

e.g., here's another example:

model = LayerFactory(;
    w1=Dense(1, 128), w2=Dense(128, 128), w3=Dense(128, 1), act=relu
) do self, x
    x = self.act(self.w1(x))
    x = self.act(self.w2(x))
    self.w3(x)
end

p = params(model)  # works!

MilesCranmer commented 2 years ago

Here's a simple implementation of a graph network in PyTorch: https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.meta.MetaLayer

MilesCranmer commented 2 years ago

It even works for compositions of LayerFactory!

function MLP(n_in, n_out, nlayers)
    LayerFactory(;
        w1=Dense(n_in, 128), w2=[Dense(128, 128) for i=1:nlayers], w3=Dense(128, n_out), act=relu
    ) do self, x
        embed = self.act(self.w1(x))
        for w in self.w2
            embed = self.act(w(embed))
        end
        self.w3(embed)
    end
end

model = LayerFactory(; mlp1=MLP(1, 128, 2), mlp2=MLP(128, 1, 3)) do self, x
    self.mlp2(self.mlp1(x))
end

darsnack commented 2 years ago

I am not super familiar with GNNs, but you might want to check out GraphNeuralNetworks.jl to see how they handle working with Flux. They do seem to have a custom GNNChain.

Okay, I think I understand what you are saying. If you have a sufficiently complex forward function that involves sub-layers, then writing it from scratch with "base" Julia + Flux is a bunch of text. As would be the case with "base" PyTorch or Jax, but those libraries have utilities built on top like your LayerFactory. So is it right that you are looking for the same in Flux?

While I have no problem with LayerFactory since it is a nice convenience utility, I want to note that if we decide to auto-@functor structs, then LayerFactory comes for free from plain Julia:

mlp(n_in, n_out, nlayers) = let w1 = Dense(n_in, 128), w2 = [Dense(128, 128) for i in 1:nlayers], w3 = Dense(128, n_out)
    return function(x)
        act = relu

        embed = act(w1(x))
        for w in w2
            embed = act(w(embed))
        end
        w3(embed)
    end
end

model = let mlp1 = mlp(1, 128, 2), mlp2 = mlp(128, 1, 3)
    x -> mlp2(mlp1(x))
end

p = params(model) # works too!

Below is just for your reference.

Looking at the link you shared, this is what I would write in Flux:

# this is one way avoiding structs completely
EdgeModel(edge_mlp = Chain(...)) = Chain(
    (src, dest, edge_attr, u, batch) -> vcat(src, dest, edge_attr, u[batch]),
    edge_mlp
)

# admittedly, structs seems nice here
Base.@kwdef struct NodeModel{T, S}
    node_mlp_1::T = Chain(...)
    node_mlp_2::S = Chain(...)
end

@functor NodeModel

function (m::NodeModel)((x, edge_index, edge_attr, u, batch))
    row, col = edge_index
    out = vcat(x[row], edge_attr)
    out = m.node_mlp_1(out)
     # not sure what this is doing but we have a NNlib.scatter
    out = scatter_mean(out, col, ...)
    out = vcat(x, out, u[batch])
    return m.node_mlp_2(out)
end

And so on. Modulo the @functor issue, I don't see how defining a class and forward function is shorter than what's above. Seem like just an extra end keyword and a separation between the struct definition and the forward definition. The models I defined above are ready to be passed into a Chain or Parallel (assuming that's what MetaLayer is).

darsnack commented 2 years ago

Or another way of putting it: Chain and friends are a sort-of a DSL for passing arguments between sub-layers. You have a need to define your own argument passing container layer, but you don't want to write all the extra stuff that goes along with a base layer like Conv. The Haiku link you shared shows a mechanism for constructing "layers" that have no type, but they do have fields, a forward pass, and a variable they are bound to. These three things in Julia are exactly what make an anonymous function! The only thing preventing this from just working^(TM) in Flux is that @functor is opt-in.

MilesCranmer commented 2 years ago

While I have no problem with LayerFactory since it is a nice convenience utility, I want to note that if we decide to auto-@functor structs, then LayerFactory comes for free from plain Julia:

I don't understand how your example works. In your example, model is a function, rather than an object; so it wouldn't remember its parameters - they would be initialized each time. Whereas LayerFactory is actually an object. Unless you meant to actually declare the w1 outside of the function, and declare them as globals?

darsnack commented 2 years ago

Oops brain fart on my part, but see the correction using a let block

darsnack commented 2 years ago

Okay but note that the anonymous function returned by the let blocks is not "a function rather than an object" in Julia because anonymous functions are implemented under the hood as structs. The variables they close over are the fields of the struct. In essence, the let + closure in Julia is a Base implementation of your LayerFactory.

ToucheSir commented 2 years ago

LayerFactory need not require a custom type either:

LayerFactory(f; layers...) = Base.Fix1(f, NamedTuple(layers))

We don't currently @functor Base.Fix1 in Functors, but that's only because we haven't gotten around to it. Given the triviality (and generality outside of ML) of this function, I don't think it has to live inside of Flux. Much like we do with Split, we could add a "cookbook" entry on how to define your own version of this in the docs.

MilesCranmer commented 2 years ago

I see, thanks. The let syntax is a new to me! Would be great when that approach actually works.

Seem like just an extra end keyword and a separation between the struct definition and the forward definition.

Don't forget the model instantiation! Compare these two:

model = LayerFactory(; w1=Dense(5, 128), w2=Dense(128, 1), act=relu) do self, x
    x = self.act(self.w1(x))
    self.w2(x)
end

(or the let method!) versus:

struct MyLayer
    w1::Dense
    w2::Dense
    act::Function
end

@functor MyLayer

function (self::MyLayer)(x)
    x = self.act(self.w1(x))
    self.w2(x)
end

model = MyLayer(Dense(5, 128), Dense(128, 1), relu)

The latter example would discourage me from using it. Note also that the second example will break if I use Revise.jl and change the inputs, whereas let and LayerFactory will just work.

Given the triviality (and generality outside of ML) of this function

Up to you but I don't see a problem with including this in the code alongside Chain... I would consider these to be core pieces of functionality for any DL framework to quickly compose custom models - much more so than Split which seems niche. Making the user implement these core pieces of functionality themselves is just another barrier to ease-of-use.

MilesCranmer commented 2 years ago

Btw, in the let example, can you access subcomponents of a model, like w1? Or are all the pieces hidden inside the closure? If not I think I might prefer having a NamedTuple with @functor pre-declared.

mcabbott commented 2 years ago

The LayerFactory thing seems cute. Maybe see how it goes for building some models in real life and figure out what the unexpected warts are?

One refinement which could be added is a macro which would put the self in for you, and perhaps arrange to print it some human-readable way.

can you access subcomponents of a model, like w1?

Yes. With Kyle's code:

julia> mlp(1,2,3)
#12 (generic function with 1 method)

julia> ans.w3.bias
2-element Vector{Float32}:
 0.0
 0.0

ToucheSir commented 2 years ago

I would consider these to be core pieces of functionality for any DL framework to quickly compose custom models

Does this mean Python frameworks don't even meet the bar then? :P

My impression is that Flux is already offering more layer helpers like Parallel or Maxout than most other frameworks (c.f. PyTorch/TF/Flax, which would make you define a custom class for both). We also want to avoid scenarios where someone blows up their code's (or worse, package's) import time by +12s unnecessarily because they decided to depend on Flux for a one-liner function.

MilesCranmer commented 2 years ago

The LayerFactory thing seems cute. Maybe see how it goes for building some models in real life and figure out what the unexpected warts are?

Will do! 👍 (when I get a chance...)

Does this mean Python frameworks don't even meet the bar then? :P

Not quite... Say what you will about Python, but the DL frameworks are very polished. Here's how you would do a zero layer MLP in

Haiku:

@hk.transform
def forward(x):
  w1 = hk.Linear(100)
  w2 = hk.Linear(10)
  return mlp2(jax.nn.relu(mlp1(x)))

params = forward.init(rng, x)

PyTorch:

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.w1 = nn.Linear(10, 100)
        self.w2 = nn.Linear(100, 1)
    def forward(self, x):
        return self.w2(F.relu(self.w1(x)))

model = Net()

My impression is that Flux is already offering more layer helpers like Parallel or Maxout

PyTorch actively discourages users from using nn.Sequential for complex operations (equivalent of Chain), since it isn't obvious what's actually going on. i.e., Sequential operations should be sequential. Users are encouraged to write their own forward function (equivalent of a custom layer) for anything more than super basic sequential patterns. I don't think it's a bad idea myself... I may choose to use vcat explicitly in a forward pass than use a parallel block just because I'm more used to that pattern.

darsnack commented 2 years ago

Don't forget the model instantiation! ...

I agree, the factory is much shorter even keeping aside the instantiation. But the factory isn't the default way to make layers in other frameworks either.

From your most recent example (in Julia):

Base.@kwdef struct Net{T, S}
    w1::T = nn.Linear(10, 100)
    w2::S = nn.Linear(100, 1)
end
(self::Net)(x) = self.w2(relu(self.w1(x)))
@functor Net # boo we also don't like this

model = Net()

What I was trying to figure out is if you wanted the default mechanism to change or a utility built on top. But I think we settled this Q! We're talking about a convenience method here.

PyTorch actively discourages users from using nn.Sequential for complex operations (equivalent of Chain), since it isn't obvious what's actually going on.

The problem here is that if I make a new model = Net(), I have no clue what Net does without reading the content of its forward. Even something called ResBlock could be non-obvious without reading the code for certain variants. In contrast, printing out a model built purely of Chain, Parallel, etc. has understandable execution from just the model architecture being printed in the REPL (you do need to know enough Flux to have seen these layers before though). You also get contextual information with named sub-layers in our containers. All of this is important to us, because code reuse is extremely common in Julia. We want people to feel confident instantiating unknown models in the REPL and using them without needing to go to the source definition. We don't like how people keep redefining code in Python frameworks.

This being said, I like the declarative nature and syntactic clarity of what you are proposing. I think the broader point here is that:

we've historically tried to avoid having N different ways of doing the same thing to limit confusion for users (i.e. why we're trying to understand where Chain, Parallel, etc. fall short)
Flux is community maintained with a very distributed process, making a concise, manageable codebase valuable (more code = more maintenance burden)

So, like Michael, I would be happy to include it...after some thought and maybe seeing if it has unforeseen limitations.

ToucheSir commented 2 years ago

Kyle beat me to it, but just to add this:

PyTorch actively discourages users from using nn.Sequential for complex operations (equivalent of Chain), since it isn't obvious what's actually going on. i.e., Sequential operations should be sequential. Users are encouraged to write their own forward function (equivalent of a custom layer) for anything more than super basic sequential patterns.

This is a good idea in some circumstances and a bad one in others. For example, torchvision models are full of Sequentials, because users want to be able to slice up and otherwise manipulate those models.

MilesCranmer commented 2 years ago

So, like Michael, I would be happy to include it...after some thought and maybe seeing if it has unforeseen limitations.

Sounds good!

I have no clue what Net does without reading the content of its forward. Even something called ResBlock could be non-obvious without reading the code for certain variants. In contrast, printing out a model built purely of Chain, Parallel, etc. has understandable execution from just the model architecture being printed in the REPL (you do need to know enough Flux to have seen these layers before though).

I'm not sure I completely understand. Is your goal to make all types of neural network models possible with Chain by itself? With the complexity of modern neural nets, this seems like it requires building an impossibly massive DSL. Why not just rely the core Julia language through user-created custom layers, with small Chain blocks for common modules like MLPs and CNNs (similar to what other DL frameworks do)? In any case, I don't think it's humanly possible to understand the internals of a modern NN by reading a sequence of modules - you ultimately have to go through the forward function when it's not a sequential stack of modules.

But maybe user could always refactor their model into separate LayerFactory with helpful names for each (@mcabbott's suggestion for adding a show method), and maybe that could help with interpretation.

We don't like how people keep redefining code in Python frameworks.

You seem to be bringing up Julia v Python... I want to be clear I am really not trying to go there (I'm on the Julia side, for the record; I've just had experience with both!). I'm purely talking about the syntax itself.

If you consider PyTorch by itself as a software and ecosystem, there is an obscene amount of code re-use. I can take someone's custom torch model with an extremely complex forward pass, and put it inside my custom model, and it works seamlessly. Again, this is just PyTorch -> Pytorch (the same DSL!); I'm not talking about the existing incompatibility between different Python frameworks! But my point is that I don't think there's intrinsic problems with users creating custom layers and sharing them. It's just like sharing any other code. If it's the expected input/output types, and well-documented, it should be okay.

Flux is community maintained with a very distributed process, making a concise, manageable codebase valuable (more code = more maintenance burden)

Making it easier to construct custom layers seems precisely aligned with your goals, no? Then users can go build these layers themselves, rather than you having to worry about building a massive library of modules. And you only need to maintain the most commonly-re-used layers.

@ToucheSir For example, torchvision models are full of Sequentials, because users want to be able to slice up and otherwise manipulate those models.

Right - you might use Sequential for common sequential pieces (like a stack of convolutions or an MLP), and then write a forward function to bring them all together in a complex way. The default printing would print each sequential piece inside a module, and perhaps a user could overload the printing to print each Sequential in a hierarchical way next to particular model parameters.

ToucheSir commented 2 years ago

Why not just rely the core Julia language through user-created custom layers, with small Chain blocks for common modules like MLPs and CNNs (similar to what other DL frameworks do)? In any case, I don't think it's humanly possible to understand the internals of a modern NN by reading a sequence of modules - you ultimately have to go through the forward function when it's not a sequential stack of modules.

I think we're all on the same page here, just that the devil is in the details :slightly_smiling_face:. Looking at the original PR which ended up spawning Parallel, one can see that Flux did converge on something like this philosophy: encourage types for non-trivial composite layers in general, but also provide slightly more powerful building blocks for the most common use cases. I'd be remiss to not note that all participants on that thread were/are also active Python ML library users, so there is certainly some diversity of opinion here!

This ties into the code reuse discussion. What I think Kyle is trying to get at is that while a framework shouldn't try to create a DSL for every possible use case, it should try to provide affordances so that users aren't unncessarily having to roll their own code for trivial features. I can't count how many times I've seen research PyTorch code which defines a number of layer types just so that they can have a skip connection. You can tell because those layers are often awkwardly named—they kind of have to be because they really only represent some intermediate chunk of a larger model which wouldn't otherwise be considered standalone (second hardest problem in computer science, etc).

Right - you might use Sequential for common sequential pieces (like a stack of convolutions or an MLP), and then write a forward function to bring them all together in a complex way. The default printing would print each sequential piece inside a module, and perhaps a user could overload the printing to print each Sequential in a hierarchical way next to particular model parameters.

Again, I think we are of roughly the same mind about this. There's a reason, Metalhead.jl, torchvision, Timm, etc. use this pattern. It's also one reason we're hesitant to loudly advertise layer building functionality which always returns a (semi-)anonymous type: you lose that important semantic information from the layer name that you get by using a named class in Python or struct in Julia.

darsnack commented 2 years ago

Let me start by saying I don't fundamentally disagree with the feature proposal. I'm just trying to shine light on the design decisions we made in FluxML. Hopefully, this is useful and not unwanted.

You seem to be bringing up Julia v Python... I want to be clear I am really not trying to go there

We have a slight miscommunication, which is my fault for using "Python" as a catch-all when I really meant "X where X is one of TF/Jax/PyTorch" (i.e. considering each framework independently). I certainly wasn't referring to NumPy/Torch/SciPy/etc...I also don't want to go there, and it seems irrelevant to our discussion. In fact, for what we're discussing (syntax for building complex models), the host language (Julia or Python) seems irrelevant.

The point of bringing up Python-based frameworks at all is because I agree with you---they are great DL frameworks. There's a lot of learn from, and so we can make useful comparisons to understand what we do wrong/right.

If you consider PyTorch by itself as a software and ecosystem, there is an obscene amount of code re-use. I can take someone's custom torch model with an extremely complex forward pass, and put it inside my custom model, and it works seamlessly. Again, this is just PyTorch -> Pytorch (the same DSL!)

This isn't exactly the type of re-use I am referring to, and I don't think the various options we are discussing would limit this kind of re-use.

Let's take a concrete example from torchvision. For ResNet, they define the residual blocks, then in Inception they define the inception modules. In Metalhead (the equivalent Flux library), both ResNet and Inception just use Parallel. If you look at the papers ([1] Fig 2. and [2] Fig. 4, 5, ...), these custom layers look remarkably similar in structure. They differ in terms what's along each branch or how many branches there are, but the overall "container layer" still does the same operation. Defining this forward pass operation over and over seems like poor code re-use.

Now, PyTorch folks could absolutely have written torchvision in a way so that this residual branch type layer has a single forward definition that gets re-used...but then they would end up writing Parallel.

Is your goal to make all types of neural network models possible with Chain by itself? With the complexity of modern neural nets, this seems like it requires building an impossibly massive DSL.

Definitely not all types of models, for two reasons: (a) it's not possible, and (b) even if it were, it would make writing some models unnecessarily cumbersome.

But I will say you can get really far without writing a massive DSL. Layers fall into two categories:

Primitive layers like Conv, Dense, Upsample, etc.
Container layers that wrap other layers and call them in a specific way like Chain, Parallel, etc.

(1) is unavoidable in every framework unless you take an explicitly functional view and make users pass in the weights, state, etc. (2) is where the possible DSL size explosion could happen. But if you take a feedforward NN, then there is a limited set of structures you can see in the DAG---namely Chain and Parallel. Metalhead.jl is written in this way, and it covers vision models from AlexNet to ViTs. It does have custom layers not in Flux, but those are mostly (1) layers.

I don't think it's humanly possible to understand the internals of a modern NN by reading a sequence of modules - you ultimately have to go through the forward function when it's not a sequential stack of modules.

I don't know...GoogLeNet's diagram is a pretty complex network but I think you can understand the flow of arguments just by looking at the figure. Even something like CLIP.

Of course, DL isn't restricted to FF DAGs, nor should it be. And I get the feeling these are the kinds of models you work with. So then you need to define a custom (2). We absolutely want users to go ahead and do this whenever they feel like they should. Or maybe even for a simple CNN, you subjectively prefer to write out the forward pass. Go for it! If you do get to the point of writing a custom (2), then your layer factory makes the syntax really short. This is why I like it, and I am in favor of adding it.

Sometimes it is better to "just write the forward pass," and sometimes it is better to use existing layers + builders to create a complex model. Both are "first class" in Flux. I don't want to leave you with the impression that we want everyone to build everything using only Chain and Parallel...that would be crazy 😬

[1]: ResNet https://arxiv.org/pdf/1512.03385v1.pdf [2]: Inception https://arxiv.org/pdf/1512.00567v3.pdf

darsnack commented 2 years ago

Oops as I was writing and editing my saga, Brian beat me to it by 40 minutes, but my browser didn't refresh :(.

mcabbott commented 2 years ago

Here is macro version, which should let you write Dense as shown, and will print it out again like that.

"""
    @Magic(forward::Function; construct...)

Creates a layer by specifying some code to construct the layer, run immediately,
and (usually as a `do` block) a function for the forward pass.
You may think of `construct` as keywords, or better as a `let` block creating local variables.
Their names may be used within the body of the `forward` function.

r = @Magic(w = rand(3)) do x w .* x end r([1,1,1]) r([10,10,10]) # same random numbers

d = @Magic(in=5, out=7, W=randn(out,in), b=zeros(out), act=relu) do x y = W * x act.(y .+ b) end d(ones(5, 10)) # 7×10 Matrix

"""
macro Magic(fex, kwexs...)
    # check input
    Meta.isexpr(fex, :(->)) || error("expects a do block")
    isempty(kwexs) && error("expects keyword arguments")
    all(ex -> Meta.isexpr(ex, :kw), kwexs) || error("expects only keyword argumens")

    # make strings
    layer = "@Magic"
    setup = join(map(ex -> string(ex.args[1], " = ", ex.args[2]), kwexs), ", ")
    input = join(fex.args[1].args, ", ")
    block = string(Base.remove_linenums!(fex).args[2])

    # edit expressions
    vars = map(ex -> ex.args[1], kwexs)
    assigns = map(ex -> Expr(:(=), ex.args...), kwexs)
    @gensym self
    pushfirst!(fex.args[1].args, self)
    addprefix!(fex, self, vars)

    # assemble
    quote
        let
            $(assigns...)
            $MagicLayer($fex, ($layer, $setup, $input, $block); $(vars...))
        end
    end |> esc
end

function addprefix!(ex::Expr, self, vars)
    for i in 1:length(ex.args)
        if ex.args[i] in vars
            ex.args[i] = :($self.$(ex.args[i]))
        else
            addprefix!(ex.args[i], self, vars)
        end
    end
end
addprefix!(not_ex, self, vars) = nothing

struct MagicLayer{F,NT<:NamedTuple}
    fun::F
    strings::NTuple{4,String}
    variables::NT
end
MagicLayer(f::Function, str::Tuple; kw...) = MagicLayer(f, str, NamedTuple(kw))
(m::MagicLayer)(x...) = m.fun(m.variables, x...)
MagicLayer(args...) = error("MagicLayer is meant to be constructed by the macro")
Flux.@functor MagicLayer

function Base.show(io::IO, m::MagicLayer)
    layer, setup, input, block = m.strings
    print(io, layer, "(", setup, ") do ", input)
    print(io, block[6:end])
end

MilesCranmer commented 2 years ago

Thanks for sharing these answers, I completely agree and I think we are all on the same page! 🙂

Here is macro version, which should let you write Dense as shown, and will print it out again like that.

This is AWESOME, nice job!! I am 👍👍 (two thumbs up) for the support of this feature as a convenient custom layer constructor.

mcabbott commented 2 years ago

No doubt that has all sorts of bugs! But fun to write. Once you make a macro, it need not be tied to the LayerFactory keyword notation like this, of course.

And whether this is likely to create pretty code or monstrosities, I don't know yet.

MilesCranmer commented 2 years ago

Could this be added to Flux.jl, with “Very experimental.” stated in bold in the docstring? I can create a PR and add a couple tests and maybe a paragraph to the docs.

MilesCranmer commented 2 years ago

Let me know if I could add it and I can make a PR? Would love to have a feature like this. The @Magic macro instantly makes Flux.jl the most elegant framework in my view.

darsnack commented 2 years ago

We've created an Fluxperimental.jl package for this purpose. Once it is set up and made public, we can ping you for a PR there (which would be appreciated!).

MilesCranmer commented 2 years ago

Cool, sounds good to me!

mcabbott commented 1 year ago

Ok, https://github.com/FluxML/Fluxperimental.jl is live

darsnack commented 1 year ago

Closing in favor of https://github.com/FluxML/Fluxperimental.jl/discussions/2 for layer factory and https://github.com/FluxML/Functors.jl/issues/46 for ProtoStruct.jl issue.

MilesCranmer commented 1 year ago

For people finding this issue, the discussion above has now resulted in the PR here: https://github.com/FluxML/Fluxperimental.jl/pull/4

FluxML / Flux.jl

Compatibility with ProtoStruct.jl, and LayerFactory ideas for custom layers #2107