Restructure makes a copy

linusheck commented 1 year ago

In some situations, you have to restructure a lot if you use Flux, for instance if you want to run your batches as seperate solves in DiffEqFlux using an EnsembleProblem. You have to use something like a ComponentArray to pass the parameters through the solver and to let the adjoint methods do their work in differentiating the solve. But restructuring using a ComponentArray is unreasonably(?) slow in Flux. Switching to Lux eliminates those problems, but it seems like something that could be implemented better in Flux or ComponentArrays.

Example code:

using Flux, Profile, PProf, Random, ComponentArrays
layer_size = 256
layer = Flux.Dense(layer_size => layer_size)
params_, re = Flux.destructure(layer)
params = ComponentArray(params_)
function eval(steps, input)
    vec = input
    for i in 1:steps
        vec = re(params)(vec)
    end
end
Profile.clear()
@profile eval(100000, rand(Float32, layer_size))
pprof(; web=true)

Here, we spend 10% of the time in sgemv matrix multiplication, another 10% in the rest of the Dense call and about 75% in the Restructure. This gets worse if the networks are smaller. As far as I can read the flame graph, the restructure seems to spend a lot of time in the GC:

0d19ec3a732aaf409e85b06ebb4ebe6392198e20_2_1044x1000

Could there be a way to mitigate this specific problem? In particular if you use the same parameters. I think this would make some example code a lot faster too.

I also opened a discourse about this because I'm not sure if it's an issue with Flux specifically: https://discourse.julialang.org/t/flux-restructure-for-componentarrays-jl-unreasonably-slow/97849/3

mcabbott commented 1 year ago

I think the story here is that Restructure makes a new copy of the model, which allocates, and this costs about 30μs:

julia> @btime $re($params);  # This is the reconstruction cost
  min 11.125 μs, mean 29.891 μs (24 allocations, 257.95 KiB)

julia> @btime copy($params);  # ... and it's mostly allocation, same mean:
  min 6.007 μs, mean 29.209 μs (2 allocations, 257.05 KiB)

Normally the allocation of the forward pass dwarfs this, but if you don't make batches and do this for every sample, then the model allocation dominates.

julia> input = rand(Float32, layer_size);  # As above, batch size of 1

julia> @btime $layer($input);  # ... hence alloc much smaller here, unusual.
  min 6.800 μs, mean 8.970 μs (2 allocations, 2.12 KiB)

julia> @btime $re($params)($input);  # Complete, as above
  min 20.458 μs, mean 51.587 μs (26 allocations, 260.08 KiB)

This isn't a regime anyone got around to optimising. It would probably be easy to make Restructure either (1) copy into the existing model, or (2) create a new model out of views not copies.

Note that ComponentArrays is doing nothing here, params isa Vector. I think you should think of ComponentArrays as another idea in the same direction as destructure, "instead" not "and". They differ in lots of ways, of course.

Note also that allocations in the gradient will typically be larger than in the forward pass. When last I checked, destructure does this in a more efficient way than ComponentArrays, which made a copy of the full parameter vector repeatedly. This will matter (for larger models) in exactly the regime you have selected.

linusheck commented 1 year ago

Thanks for your thorough reply! Good to know that ComponentArrays isn't doing anything. The offending line seems to be this one: https://github.com/FluxML/Optimisers.jl/blob/8a37946ea8638642d40eae210fd5c882e3f1cff9/src/destructure.jl#L103

I wonder if just using ProjectTo a bit smarter might fix the allocations and use views

mcabbott commented 1 year ago

Changing that line to use a view would probably work fine. ProjectTo won't mind, but I'm not sure whether all operations downstream will be happy with a reshaped view (e.g. will CuArray dispatch work?)

Maybe the bigger question is what the interface should be, e.g. whether this should be a keyword option.

Or whether it should be a separate destructure! which copies into the model. That may solve both questions.

CarloLucibello commented 4 months ago

I think we should have a copyless alternative to restructure, possible name ps, re = trainable_components(x), that returns a ComponentArray ps and a reconstruction function re(ps) == x

FluxML / Optimisers.jl

Restructure makes a copy #146