Use SnoopPrecompile.jl - Githubissues

FluxML / FluxTraining.jl

A flexible neural net training library inspired by fast.ai

https://fluxml.ai/FluxTraining.jl

MIT License

117 stars 25 forks source link

Use SnoopPrecompile.jl #140

Closed lorenzoh closed 1 year ago

lorenzoh commented 1 year ago

This adds a basic precompile statement using SnoopPrecompile.jl.

This reduces the Time-to-first-fit! by

Measurements:

using FluxTraining: 21s (this PR), 19s (master) -> 2s slower
fit!(testlearner(), 1): 14.5s (this PR), 30s (master) -> 15s faster
both: 35.5s (this PR), 49s (master) -> 13.5s/40% faster

This seems like a clear win for me, except for the longer precompilation time which will only occur once for regular package usage. Has anyone tried using SnoopPrecompile.jl for other packages in the FluxML org?

github-actions[bot] commented 1 year ago

A documentation preview has been successfully built, view it here: Documentation preview PR-140

ToucheSir commented 1 year ago

Has anyone tried using SnoopPrecompile.jl for other packages in the FluxML org

I use it for Zygote in https://github.com/FluxML/Zygote.jl/pull/1281, but didn't get nearly the same speedup because it seems to be bottlenecked by LLVM time. This is quite the improvement!

lorenzoh commented 1 year ago

So should I go ahead with this? I'm not sure how much of this actually comes from Flux.jl vs FluxTraining.jl. Maybe we should try this with Flux.jl as well.

ToucheSir commented 1 year ago

We should, though that seems like a much bigger project given the size of Flux's API. If you're able to @snoopi_deep that fit!(testlearner(), 1) call, we could look at the flamegraph and see how much is Flux vs FluxTraining (vs Zygote).

lorenzoh commented 1 year ago

The 3 big chunks are all Zygote.jl, so I am estimating around 2/3 of the inference time is Zygote.jl

ToucheSir commented 1 year ago

Good to know. I just rebased the Zygote PR, are you able to test again with it?

lorenzoh commented 1 year ago

Yup, now testing with FluxTraining#master and Zygote#bc/precompile:

using FluxTraining: 18.2s
fit!(testlearner(), 1): 18s
both: 36.2s

Safe to say that Zygote.jl is the culprit here :P . I think this implies the downstream improvements from #bc/precompile are more significant than for Zygote.jl itself. If that's the case, we should definitely merge that one.

lorenzoh commented 1 year ago

Finally, the one with precompilation in both Zygote and FluxTraining is even better:

using FluxTraining: 22s
fit!(testlearner(), 1): 10.8s
total: 32.8s

So I will be merging this PR as well if it looks good to you Brian.