EnzymeAD / Enzyme.jl

Julia bindings for the Enzyme automatic differentiator
https://enzyme.mit.edu
MIT License
450 stars 64 forks source link

Something invalidating `Enzyme.__init__` causes a heavy package load time penalty #776

Open KristofferC opened 1 year ago

KristofferC commented 1 year ago

When loading OmniPackage.jl(https://github.com/JuliaComputing/OmniPackage.jl/) a big part of its load time (around 10s) is spent compiling and running the __init__ method in Enzyme, profile available at https://kristofferc.github.io/tracy-traces/omnipackage/.

Ideally, Enzyme would somehow protect itself against this, by either making the code in __init__ less vulnerable to invalidations or some other way (maybe using a frozen world age).

cc @vchuravy

wsmoses commented 1 year ago

This should now be resolved by https://github.com/EnzymeAD/Enzyme.jl/pull/881

vchuravy commented 1 year ago

I am not sure that fixed all instances. Need to confirm with an actual profile

KristofferC commented 11 months ago

I just want to point out that this is still happening and causes ~18s load time increase to any package that recursively ends up getting Enzyme as a dependency.

wsmoses commented 11 months ago

hm @KristofferC do you have any information regarding what could be causing this?

KristofferC commented 11 months ago
wsmoses commented 11 months ago

is there a tool or something to profile what methods/invalidations are the source?

KristofferC commented 11 months ago

One would usually use SnoopCompile. I can try (later) to see if I can find what actually ends up invalidating it.

wsmoses commented 11 months ago

I will note that the Enzyme Blas init loader presently is expected to have a long load time and I'm not sure of how to fix that.

Specifically here: https://github.com/EnzymeAD/Enzyme.jl/blob/982949462e7e0a2d07ccf4158c440561117dc152/src/compiler/validation.jl#L67

We need the runtime address of blas and julia runtime libraries to be able to identify what function is being ccall'ed (in IR it is an integer). This is definitionally runtime information so it couldn't be preserved. If you have ideas for a design that avoid the expense, I'm all ears.

If it's somewhere else causing the issue definitely interested in what's causing it so we can fix it.

KristofferC commented 11 months ago
julia> @time using Enzyme
  0.383814 seconds (566.30 k allocations: 33.282 MiB, 21.24% compilation time: 27% of which was recompilation)

But:

julia> @time using Tracker
1.450969 seconds (1.83 M allocations: 109.970 MiB, 7.33% gc time, 31.56% compilation time: 26% of which was recompilation)

julia> @time using Enzyme
 21.592203 seconds (43.97 M allocations: 2.517 GiB, 1.84% gc time, 98.84% compilation time: 100% of which was recompilation)

is some sort of MWE. Will narrow it down more.

KristofferC commented 11 months ago

Okej, Tracker committed the mortal sin of defining show on a Type. With https://github.com/FluxML/Tracker.jl/pull/156 this is improved to

julia> @time using Tracker
  1.463012 seconds (1.90 M allocations: 113.004 MiB, 4.50% gc time, 25.19% compilation time: 16% of which was recompilation)

julia> @time using Enzyme
  0.590007 seconds (589.92 k allocations: 33.254 MiB, 14.00% gc time, 56.16% compilation time: 99% of which was recompilation)

Anyway, as a question, is there anyway this work can be moved from __init__ to until when it is actually needed? So you pay the cost if you use it but otherwise not?

staticfloat commented 11 months ago

It seems to me that that __init__() is just setting up ptr_map which is only used by restore_lookup, which could just do an initialization check, and if ptr_map is not initialized, it runs that setup code.

KristofferC commented 11 months ago

From some testing, it seems the __init__ that takes a long time to compile is https://github.com/EnzymeAD/Enzyme.jl/blob/019258a55ce7f9699c91cb87bb584640dd5b33e2/src/compiler.jl#L2425-L2454

wsmoses commented 11 months ago

Okay if that init takes longer to compile I am a bit more unclear how to resolve it.

So some context, this code interfaces between Enzyme proper (aka Enzyme_jll) and Enzyme.jl to registister all of the julia specific handling. As a result this actually is most of Enzyme.jl's code, so this is highly desirable to be precompiled.

this is done by taking many functions in Julia code, making a cfunction of it, then passing the Julia cfunction runtime pointer into Enzyme_jll. While the actual call to pass the pointer (and the pointer itself) are runtime specific values, the actual compilation of those cfunctions on those specified types is fixed, and should be in theory able to be precompiled (but I presume not in practice here).

Those cfunctions are wrapped in a macro (see here: https://github.com/EnzymeAD/Enzyme.jl/blob/019258a55ce7f9699c91cb87bb584640dd5b33e2/src/rules/llvmrules.jl#L1030 ) which creates an anonymous wrapper function to make it easier, which perhaps makes things more difficult.

Regardless, I'm not sure what the state of julia internals for precompilation are here, but that's what I can see from our end.

KristofferC commented 11 months ago

Running the register_llvm_rules() in a module where optimizations are turned off more or less fixes the bad behavior:

https://github.com/KristofferC/Enzyme.jl/commit/4246ddb42b18ef43958774b13621f6160f7992a8#diff-21145f7299f4174cfb41f7c0c845a0ee798fc065bae0a352171552dc4b810508R1033

However, I have no idea what impact this has on runtime performance so that would have to be investigated. Might not be a good idea at all.

wsmoses commented 11 months ago

Yeah disabling optimizations seems like a bad idea.

Is there any other way we can cut down on load time here?

On Mon, Nov 20, 2023 at 3:29 AM Kristoffer Carlsson < @.***> wrote:

Running the register_llvm_rules() in a module where optimizations are turned off more or less fixes the bad behavior:

@.***

diff-21145f7299f4174cfb41f7c0c845a0ee798fc065bae0a352171552dc4b810508R1033

https://github.com/KristofferC/Enzyme.jl/commit/4246ddb42b18ef43958774b13621f6160f7992a8#diff-21145f7299f4174cfb41f7c0c845a0ee798fc065bae0a352171552dc4b810508R1033

However, I have no idea what impact this has on runtime performance so that would have to be investigated. Might not be a good idea at all.

— Reply to this email directly, view it on GitHub https://github.com/EnzymeAD/Enzyme.jl/issues/776#issuecomment-1818776971, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTUXDWSHKOS7FYUK3RHJDYFMWGZAVCNFSM6AAAAAAXJ25G6OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJYG43TMOJXGE . You are receiving this because you modified the open/close state.Message ID: @.***>

KristofferC commented 11 months ago

Is there any other way we can cut down on load time here?

I guess these are the possibilities:

vchuravy commented 11 months ago

Yeah disabling optimizations seems like a bad idea.

I think disabling optimizations for the registration functions should be fine. But I wonder if it impacts the callbacks, if it is only the trampoline that should be fine.

KristofferC commented 11 months ago

But I wonder if it impacts the callbacks, if it is only the trampoline that should be fine.

That's what I was worried about as well. I think the compilation time was in fact partly due to inferring the callbacks so my change making things faster probably means the callbacks also stopped getting inferred...

vchuravy commented 8 months ago

Concretly we are spending ~30s on compiling init

image