Use PrecompileTools.jl - Githubissues

RomeoV commented 1 year ago

Motivation and description

Currently the startup time for using this package is quite long. For example, running the code snippet below takes about 80s on my machine, which is 99% overhead time (the two epochs are practically instant).

To compare, a basic Flux model only takes about 6s after startup. Since in Julia 1.9 and 1.10 a lot of the compile time can be "cached away" I think we'd greatly benefit from integrating something like PrecompileTools.jl into the packages.

Possible Implementation

I saw there's already a workload.jl file (which basically just runs all the tests) which is used for sysimg creation. Perhaps we can do something similar for the PrecompileTools.jl directive.

I can try to get a PR started in the coming days.

Sample code

using FastAI, FastVision, Metalhead, Random
data, blocks = load(datarecipes()["mnist_png"])
idx = randperm(length(data[1]))[1:100]
data_ = (mapobs(data[1].f, data[1].data[idx]), mapobs(data[2].f, data[2].data[idx]))
task = ImageClassificationSingle(blocks)
learner = tasklearner(task, data_, backbone=ResNet(18).layers[1], callbacks=[ToGPU()])
fitonecycle!(learner, 2)
exit()

lorenzoh commented 1 year ago

Sounds good! Have you tested the Speedup already?

RomeoV commented 1 year ago

Some issues I'm running into:

datarecipes not available at precompile time. Also, I suppose we don't have access to any data that needs to be downloaded. @lorenzoh does FastVision ship with some datarecipes before running the Module.init function? Otherwise I'll just mock some data.
currently CUDA precompile seems to be broken. See https://github.com/JuliaGPU/CUDA.jl/issues/2006
since FastVision doesn't depend on Metalhead (fair enough) I can't precompile with the ResNet. Fair enough. Maybe also makes sense to mock a model.

For now I'm just trying to test what speedup we could hope for by making a separate "startup" package (as is suggested here) that loads all of FastVision, Metalhead, etc and then basically has my code above as a precompile code, but without GPU and with mocked data. I'll report what speedup that brings.

RomeoV commented 1 year ago

Hmm this approach brings the TTF-epoch from 77s to about 65s, which is a speed up for sure, but I was kind of hoping for even more. I will have to look a bit deeper at where the time is spent. It might be all GPU stuff, in which case we'll need to wait for the above mentioned issue to conclude. There's also the possibility that on first execution cuDNN has to run a bunch of micro-benchmarks to determine some algorithms choices. I filed a WIP PR to cache that a while ago, but haven't looked at it in a while https://github.com/JuliaGPU/CUDA.jl/pull/1948. If it turns out that the TTF-epoch is dominated by that I'll push that a bit more.

RomeoV commented 1 year ago

Another update - I ran a training similar to the code above, but without any FastAI.jl/FluxTraining.jl, i.e. just Flux.jl and Metalhead.jl (see code below).

With using the precompile approach from above, timings are 27s for CPU only and 55s for GPU.

In particular, 55s is only about 15% less than 65s - in other words, the fact that my above measurements are at 65s seems mostly dominated not by the FastAI infrastructure, but rather by the GPUCompiler etc. It still might be worth it to follow through with this issue, or at least write some instructions how to make a startup package, but further improvements must come from the Flux infrastructure itself.

see code

```julia using Flux, Metalhead device_ = eval(device) import Flux: gradient import Flux.OneHotArrays: onehotbatch labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] data = ([rand(Float32, 32, 32, 3) for _ in 1:100], [rand(labels) for _ in 1:100]) model = ResNet(18; nclasses=10) |> device_ train_loader = Flux.DataLoader(data, batchsize=10, shuffle=true, collate=true) opt = Flux.Optimise.Descent() ps = Flux.params(model) loss = Flux.Losses.logitcrossentropy for epoch in 1:2 for (x, y) in train_loader yb = onehotbatch(y, labels) |> device_ model(x |> device_) grads = gradient(ps) do loss(model(x |> device_), yb) end Flux.Optimise.update!(opt, ps, grads) end end ```

ToucheSir commented 1 year ago

I suspected as much. You'll want to drill further down into the timings to see if something like https://github.com/JuliaGPU/GPUCompiler.jl/issues/65 is at play.

RomeoV commented 1 year ago

Thanks. When i find some time, I'll also check if https://github.com/JuliaGPU/CUDA.jl/issues/1947 helps. But probably I'll move that discussion somewhere else.

RomeoV commented 11 months ago

Update on this: Since https://github.com/JuliaGPU/CUDA.jl/issues/2006 seems to be fixed, it's possible to just write your own little precompile directive, which reduces TTF Epoch to about 12 seconds -- quite workable!

MyModule.jl:

module FastAIStartup
using FastAI, FastVision, Metalhead
import FastVision: RGB, N0f8
import Flux
import Flux: gradient
import Flux.OneHotArrays: onehotbatch

import PrecompileTools: @setup_workload, @compile_workload
@setup_workload begin
    labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
    @compile_workload begin
        # with FastAI.jl
        data = ([rand(RGB{N0f8}, 32, 32) for _ in 1:100],
                [rand(labels) for _ in 1:100])
        blocks = (Image{2}(), FastAI.Label{String}(labels))
        task = ImageClassificationSingle(blocks)
        learner = tasklearner(task, data,
                            backbone=backbone(EfficientNet(:b0)),
                            callbacks = [ToGPU()])
        fitonecycle!(learner, 2)
    end
end
end # module FastAIStartup

benchmark.jl

using FastAI, FastVision, Metalhead
import FastVision: RGB, N0f8
import Flux
import Flux: gradient
import Flux.OneHotArrays: onehotbatch

labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
data = ([rand(RGB{N0f8}, 64, 64) for _ in 1:100],
        [rand(labels) for _ in 1:100])
blocks = (Image{2}(), FastAI.Label{String}(labels))
task = ImageClassificationSingle(blocks)
learner = tasklearner(task, data,
                      backbone=backbone(EfficientNet(:b0)),
                      callbacks = [ToGPU()])
fitonecycle!(learner, 2)

julia> @time include("benchmark.jl")
 11.546966 seconds (7.37 M allocations: 731.768 MiB, 4.15% gc time, 27.73% compilation time: 3% of which was recompilation)

RomeoV commented 11 months ago

I still think though it makes sense to move some of the precompile directives into this module. Very broadly, something like:

@compile_workload begin
labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
data = ([rand(RGB{N0f8}, 64, 64) for _ in 1:100],
        [rand(labels) for _ in 1:100])
blocks = (Image{2}(), FastAI.Label{String}(labels))
task = ImageClassificationSingle(blocks)j
learner = tasklearner(task, data,
                      backbone=backbone(mockmodel(task)))
fitonecycle!(learner, 2)

# enable this somehow only if CUDA is loaded?
learner_gpu = tasklearner(task, data,
                      backbone=backbone(mockmodel(task)),
                      callbacks = [ToGPU()])
fitonecycle!(learner_gpu, 2)
end

ToucheSir commented 11 months ago

I'm on board with adding precompile workloads, but only if we can ensure they don't use a bunch of CPU + memory at runtime (compile time is fine), don't modify any global state (e.g. default RNG) and don't do any I/O. That last one is most important because it's caused hangs during precompilation for other packages. That may mean strategic calls to precompile in some places instead of solely using PrecompileTools.

FluxML / FastAI.jl

Use PrecompileTools.jl #284

Motivation and description

Possible Implementation

Sample code