Open MilesCranmer opened 11 months ago
Okay I have verified the gradients are not broken. It's just extremely long. For example, if I reduce the number of functions to just 4 in total:
operators = OperatorEnum((+, -), (cos, sin))
then the compilation with Val(2)
takes about 30 seconds to compile. But every function I add here, it seems to exponentially scale the compilation time.
@wsmoses I remember you mentioned some cache lock may be hanging here: https://github.com/EnzymeAD/Enzyme.jl/issues/1018#issuecomment-1689434106. Could that be the cause of this?
cc @vchuravy
Yeah that last issue seemed to have some sort of deadlock (aka a lock state that would never release)
@MilesCranmer would you be able to simplify your hanging MWE.
It doesn't need to compute the same thing as long as it still hangs (e.g. sum -> first, etc). The plentiful macros, generated functions, etc is making it difficult to diagnose.
Unfortunately all of this stuff is required to reproduce it. The generated functions are necessary for Base.Cartesian.@nif
to work, and that itself is necessary for Enzyme to work, otherwise there’s a type instability in the operators used. You can see the statements that get turned on with fuse_level > 1
. Those are what gives the very large compilation time.
Can you look at https://github.com/JuliaGPU/GPUCompiler.jl/blob/21ca075c1e91fe0c15f1330ab487b4831013ec1f/src/jlgen.jl#L175 after the compilation?
@MilesCranmer in particular I am interesting in how much it grows.
Oh I see what you mean. Yeah let me check.
Okay for both fuse levels it is the same length of dictionary – just 3.
Here's the code I'm using to test (referencing the code in the original snippet above)
function gen_f(::Val{fuse_level}) where {fuse_level}
function (tree, X, operators, output)
output[] = sum(eval_tree_array(tree, X, operators, Val(fuse_level))[1])
return nothing
end
end
function run_test(fuse_level::Integer)
# Operators to use:
operators = OperatorEnum((+, -), (cos, sin))
# Variables:
x1, x2, x3 = (i -> Node(Float64; feature=i)).(1:3)
# Expression:
tree = Node(1, x1, Node(1, x2)) # == x1 + cos(x2)
# Input data
X = randn(3, 100)
# Output:
# eval_tree_array(tree, X, operators, Val(2))
output = [0.0]
doutput = [1.0]
dX = zeros(size(X))
autodiff(
Reverse,
gen_f(Val(fuse_level)),
Const(tree),
Duplicated(X, dX),
Const(operators),
Duplicated(output, doutput)
)
dX
end
This number of operators is still pretty light. If I go to OperatorEnum((+, -, *, /), (cos, sin))
and fuse_level=2
, it will hit a stack overflow with no explanation for where it occurred. So it seems like there is some recursive function that is working really really hard for this type of branching?
If I first do run_test(1)
, and then in the same session, run_test(2)
, it says that the cache is 5 in length.
If maybe you are asking about the contents of the entries of GLOBAL_CI_CACHES
(?), then this is the result:
I first run run_test(1)
.
Then, GLOBAL_CI_CACHES
is length 3.
Within that, the contents are:
k = CompilerConfig for Enzyme.Compiler.EnzymeTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 1006
k = CompilerConfig for GPUCompiler.NativeCompilerTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 1011
k = CompilerConfig for Enzyme.Compiler.EnzymeTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 0
I then run run_test(2)
. The cache has 5 elements in total. The contents are:
k = CompilerConfig for Enzyme.Compiler.EnzymeTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 1006
k = CompilerConfig for Enzyme.Compiler.EnzymeTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 1017
k = CompilerConfig for GPUCompiler.NativeCompilerTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 1025
k = CompilerConfig for Enzyme.Compiler.EnzymeTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 0
k = CompilerConfig for Enzyme.Compiler.EnzymeTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 0
If I then change the definition of run_test
so that the OperatorEnum
has an additional operator, and run run_test(1)
, I see the NativeCompilerTarget
have 1039 entries. The EnzymeTarget
for that run seems to be 1010 entries.
With that updated run_test
, if I then do run_test(2)
, I see NativeCompilerTarget
grow to 1045 entries, and the latest EnzymeTarget
have 1024 entries.
@wsmoses I just managed to find a way to further reduce the code while still getting the long compilation behavior (updated first comment). It's a bit faster overall so I'm not sure if all the same behavior is there, but the exponential scaling from fuse_level=1 to fuse_level=2, or for more operators, seems to be the same.
(Note that the generated functions cannot be removed as they are required for the Base.Cartesian.@nif
and that is required for Enzyme to know the operators at compile time.)
Hey both, Just wanted to check-in on this and see how things are going and whether you've had a chance to check it out? Thanks for all your work on this great package. Best, Miles
To temper expectations, I probably won't have time to look at things until end of January.
@MilesCranmer the latest release will have just added substantial improvements to compile time. However I don't think they address the root cause (presumably nested gpucompiler compilation?), so it'll be a constant overhead speedup -- but still not nothing.
@vchuravy do you have time to diagnose this together next week?
sorry finally starting to investigate this more closely.
for the first function which you said was a minute compile.
julia> @time autodiff(
Reverse,
f,
Const(tree),
Duplicated(X, dX),
Const(operators),
Duplicated(output, doutput)
)
45.757168 seconds (27.99 M allocations: 1.761 GiB, 1.81% gc time, 100.00% compilation time)
((nothing, nothing, nothing, nothing),)
julia> @time autodiff(
Reverse,
f,
Const(tree),
Duplicated(X, dX),
Const(operators),
Duplicated(output, doutput)
)
0.000065 seconds (161 allocations: 10.609 KiB)
((nothing, nothing, nothing, nothing),)
Time seems to persist on current main.
julia> @time autodiff(
Reverse,
f,
Const(tree),
Duplicated(X, dX),
Const(operators),
Duplicated(output, doutput)
)
^CERROR: InterruptException:
Stacktrace:
[1] EnzymeCreatePrimalAndGradient(logic::Enzyme.Logic, todiff::LLVM.Function, retType::Enzyme.API.CDIFFE_TYPE, constant_args::Vector{…}, TA::Enzyme.TypeAnalysis, returnValue::Bool, dretUsed::Bool, mode::Enzyme.API.CDerivativeMode, width::Int64, additionalArg::Ptr{…}, forceAnonymousTape::Bool, typeInfo::Enzyme.FnTypeInfo, uncacheable_args::Vector{…}, augmented::Ptr{…}, atomicAdd::Bool)
@ Enzyme.API ~/git/Enzyme.jl/src/api.jl:154
[2] enzyme!(job::GPUCompiler.CompilerJob{…}, mod::LLVM.Module, primalf::LLVM.Function, TT::Type, mode::Enzyme.API.CDerivativeMode, width::Int64, parallel::Bool, actualRetType::Type, wrap::Bool, modifiedBetween::NTuple{…}, returnPrimal::Bool, expectedTapeType::Type, loweredArgs::Set{…}, boxedArgs::Set{…})
@ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:3177
[3] codegen(output::Symbol, job::GPUCompiler.CompilerJob{…}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, toplevel::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
@ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:5070
[4] codegen
@ ~/git/Enzyme.jl/src/compiler.jl:4477 [inlined]
[5] _thunk(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams}, postopt::Bool)
@ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:5755
[6] _thunk
@ ~/git/Enzyme.jl/src/compiler.jl:5755 [inlined]
[7] cached_compilation
@ ~/git/Enzyme.jl/src/compiler.jl:5793 [inlined]
[8] (::Enzyme.Compiler.var"#554#555"{DataType, DataType, DataType, Enzyme.API.CDerivativeMode, NTuple{5, Bool}, Int64, Bool, Bool, UInt64, DataType})(ctx::LLVM.Context)
@ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:5859
[9] JuliaContext(f::Enzyme.Compiler.var"#554#555"{DataType, DataType, DataType, Enzyme.API.CDerivativeMode, NTuple{5, Bool}, Int64, Bool, Bool, UInt64, DataType}; kwargs::@Kwargs{})
@ GPUCompiler ~/.julia/packages/GPUCompiler/kqxyC/src/driver.jl:52
[10] JuliaContext(f::Function)
@ GPUCompiler ~/.julia/packages/GPUCompiler/kqxyC/src/driver.jl:42
[11] #s2027#553
@ ~/git/Enzyme.jl/src/compiler.jl:5811 [inlined]
[12] var"#s2027#553"(FA::Any, A::Any, TT::Any, Mode::Any, ModifiedBetween::Any, width::Any, ReturnPrimal::Any, ShadowInit::Any, World::Any, ABI::Any, ::Any, ::Type, ::Type, ::Type, tt::Any, ::Type, ::Type, ::Type, ::Type, ::Type, ::Any)
@ Enzyme.Compiler ./none:0
[13] (::Core.GeneratedFunctionStub)(::UInt64, ::LineNumberNode, ::Any, ::Vararg{Any})
@ Core ./boot.jl:602
[14] autodiff
@ ~/git/Enzyme.jl/src/Enzyme.jl:286 [inlined]
[15] autodiff
@ ~/git/Enzyme.jl/src/Enzyme.jl:315 [inlined]
[16] autodiff(::ReverseMode{false, FFIABI, false}, ::typeof(f), ::Const{Node{Float64}}, ::Duplicated{Matrix{Float64}}, ::Const{OperatorEnum{Tuple{…}, Tuple{…}}}, ::Duplicated{Vector{Float64}})
@ Enzyme ~/git/Enzyme.jl/src/Enzyme.jl:300
[17] macro expansion
@ ./timing.jl:279 [inlined]
[18] top-level scope
@ ./REPL[16]:1
Some type information was truncated. Use `show(err)` to see complete types.
Looks like its in the autodiff function itself rather than a deadlock (hopefully).
@vchuravy do you mind hitting it with a profile like when we were perf debugging sarah's code?
I have an attempt v2 of using Enzyme in SymbolicRegression.jl here: https://github.com/MilesCranmer/SymbolicRegression.jl/pull/326. This is the extension file itself: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/98d6329355eae8567304eed3e163eba300d1a7d8/ext/SymbolicRegressionEnzymeExt.jl.
I got some small-scale tests working (takes a while to compile), but when I do a full test and use it as the AD backend for symbolic regression, it just hangs.
Looking at btop
:
I can see that Julia isn't even spinning the CPUs.
It's weird because it seems to be worse than before even. Before, even though it took a very long time, it eventually compiled. But now it seems to just freeze. Any guesses what it's from? This is on Julia 1.10.4 on macOS. Let me know if there's any other info I can try to provide.
It's weird because it seems to be worse than before even. Before, even though it took a very long time, it eventually compiled. But now it seems to just freeze. Any guesses what it's from? This is on Julia 1.10.4 on macOS. Let me know if there's any other info I can try to provide.
Attach lldb and get a stacktrace on all threads. Are you using multiple threads and autodiff_deferred
?
Attach lldb
Will do
Are you using multiple threads
Yes (auto
) I turned off multiple threads within the search – it should run serially (though Julia itself has access to multiple)
and
autodiff_deferred
?
No
Ran in lldb, maybe this is it @vchuravy?
(lldb) frame info
frame #0: 0x000000010535467c libjulia-internal.1.10.4.dylib`_jl_mutex_lock [inlined] _jl_mutex_wait(self=0x000000010bf65c30, lock=0x00000001056b1fc8, safepoint=1) at threading.c:847:17 [opt]
And the backtrace:
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x000000010535467c libjulia-internal.1.10.4.dylib`_jl_mutex_lock [inlined] _jl_mutex_wait(self=0x000000010bf65c30, lock=0x00000001056b1fc8, safepoint=1) at threading.c:847:17 [opt]
frame #1: 0x0000000105354638 libjulia-internal.1.10.4.dylib`_jl_mutex_lock(self=0x000000010bf65c30, lock=0x00000001056b1fc8) at threading.c:875:5 [opt]
frame #2: 0x0000000105af0f28 libjulia-codegen.1.10.4.dylib`::jl_generate_fptr_impl(jl_method_instance_t *, size_t, int *) [inlined] jl_mutex_lock(lock=<unavailable>) at julia_locks.h:65:5 [opt]
frame #3: 0x0000000105af0f18 libjulia-codegen.1.10.4.dylib`jl_generate_fptr_impl(mi=0x00000001777dfc10, world=31546, did_compile=0x000000015768e484) at jitlayers.cpp:483:5 [opt]
frame #4: 0x0000000105309e3c libjulia-internal.1.10.4.dylib`jl_compile_method_internal(mi=0x00000001777dfc10, world=31546) at gf.c:2481:16 [opt]
frame #5: 0x000000010530d104 libjulia-internal.1.10.4.dylib`ijl_apply_generic [inlined] _jl_invoke(F=0x0000000174ff0670, args=0x000000015768e5a8, nargs=2, mfunc=0x00000001777dfc10, world=31546) at gf.c:2887:16 [opt]
frame #6: 0x000000010530d0d0 libjulia-internal.1.10.4.dylib`ijl_apply_generic(F=0x0000000174ff0670, args=0x000000015768e5a8, nargs=<unavailable>) at gf.c:3077:12 [opt]
frame #7: 0x0000000171a174d0 X2eIS_auMC0.dylib`julia_value_gradientNOT.NOT._25951 at interface.jl:82
frame #8: 0x0000000171a68c7c X2eIS_auMC0.dylib`julia_initial_state_25915 at bfgs.jl:94
frame #9: 0x0000000171300de4
frame #10: 0x00000001713089cc
frame #11: 0x000000016751c208
frame #12: 0x00000001712c841c
frame #13: 0x000000010532c8d8 libjulia-internal.1.10.4.dylib`start_task [inlined] jl_apply(args=<unavailable>, nargs=1) at julia.h:1982:12 [opt]
frame #14: 0x000000010532c8cc libjulia-internal.1.10.4.dylib`start_task at task.c:1238:19 [opt]
When I continue and then ctrl-c:
(lldb) c
Process 27992 resuming
Process 27992 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x000000010535467c libjulia-internal.1.10.4.dylib`_jl_mutex_lock [inlined] _jl_mutex_wait(self=0x000000010bf65c30, lock=0x00000001056b1fc8, safepoint=1) at threading.c:847:17 [opt]
844 uv_mutex_unlock(&tls_lock);
845 }
846 jl_cpu_suspend();
-> 847 owner = jl_atomic_load_relaxed(&lock->owner);
848 }
849 }
850
When I run it with --threads=1
:
julia --startup-file=no --project=. --threads=1 example.jl
ERROR: LoadError: TaskFailedException
nested task error: StackOverflowError:
Stacktrace:
[1] threading_run(fun::SymbolicRegression.SingleIterationModule.var"#296#threadsfor_fun#5"{SymbolicRegression.SingleIterationModule.var"#296#threadsfor_fun#4#6"{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}, Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, false, false, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}, ADTypes.AutoEnzyme{Nothing}}, BitVector, Vector{Float64}, UnitRange{Int64}}}, static::Bool)
@ Base.Threads ./threadingconstructs.jl:172
[2] macro expansion
@ ./threadingconstructs.jl:220 [inlined]
[3] macro expansion
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/Utils.jl:155 [inlined]
[4] optimize_and_simplify_population(dataset::Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}, pop::Population{Float32, Float32, Expression{Float32, Node{Float32}, @NamedTuple{operators::Nothing, variable_names::Nothing}}}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, false, false, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}, ADTypes.AutoEnzyme{Nothing}}, curmaxsize::Int64, record::Dict{String, Any})
@ SymbolicRegression.SingleIterationModule ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SingleIteration.jl:111
[5] _dispatch_s_r_cycle(in_pop::Population{Float32, Float32, Expression{Float32, Node{Float32}, @NamedTuple{operators::Nothing, variable_names::Nothing}}}, dataset::Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, false, false, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}, ADTypes.AutoEnzyme{Nothing}}; pop::Int64, out::Int64, iteration::Int64, verbosity::Int64, cur_maxsize::Int64, running_search_statistics::SymbolicRegression.AdaptiveParsimonyModule.RunningSearchStatistics)
@ SymbolicRegression ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:1150
[6] macro expansion
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:841 [inlined]
[7] macro expansion
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SearchUtils.jl:118 [inlined]
[8] _warmup_search!(state::SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Expression{Float32, Node{Float32}, @NamedTuple{operators::Nothing, variable_names::Nothing}}, Tuple{Population{Float32, Float32, Expression{Float32, Node{Float32}, @NamedTuple{operators::Nothing, variable_names::Nothing}}}, HallOfFame{Float32, Float32, Expression{Float32, Node{Float32}, @NamedTuple{operators::Nothing, variable_names::Nothing}}}, Dict{String, Any}, Float64}, Channel}, datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:serial, 1, false}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, false, false, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}, ADTypes.AutoEnzyme{Nothing}})
@ SymbolicRegression ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:836
[9] _equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:serial, 1, false}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, false, false, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}, ADTypes.AutoEnzyme{Nothing}}, saved_state::Nothing)
@ SymbolicRegression ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:622
[10] equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}; niterations::Int64, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, false, false, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}, ADTypes.AutoEnzyme{Nothing}}, parallelism::Symbol, numprocs::Nothing, procs::Nothing, addprocs_function::Nothing, heap_size_hint_in_bytes::Nothing, runtests::Bool, saved_state::Nothing, return_state::Nothing, verbosity::Nothing, progress::Nothing, v_dim_out::Val{1})
@ SymbolicRegression ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:595
[11] equation_search
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:473 [inlined]
[12] #equation_search#26
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:436 [inlined]
[13] equation_search
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:382 [inlined]
[14] #equation_search#28
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:466 [inlined]
[15] top-level scope
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/example.jl:14
in expression starting at /Users/mcranmer/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/example.jl:14
And when I remove that @threads
call, I get:
> julia --startup-file=no --project=. --threads=1 example.jl
ERROR: LoadError: StackOverflowError:
in expression starting at /Users/mcranmer/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/example.jl:14
Hahaha I found a hack to make it work. Increase the stack size 😆
with_stack(f, n) = fetch(schedule(Task(f, n)))
hall_of_fame = with_stack(2_000_000_000) do
equation_search(X, y; niterations=40, options=options, parallelism=:serial)
end
This actually works in the single-threaded version. It's even surprisingly pretty fast in compilation! (Maybe the multiple threads were interfering with each other?)
But... why is the stack exploding?
(Also... this crashes in multi-threaded mode, see below. Maybe spawned tasks don't inherit the parent task's stack size?)
fatal: error thrown and no exception handler available.
ErrorException("attempt to switch to exited task")
ijl_error at /Users/mcranmer/PermaDocuments/julia/src/rtutils.c:41
ijl_switch at /Users/mcranmer/PermaDocuments/julia/src/task.c:634
fatal: error thrown and no exception handler available.
ErrorException("attempt to switch to exited task")
try_yieldto at ./task.jl:921
ijl_error at /Users/mcranmer/PermaDocuments/julia/src/rtutils.c:41
ijl_switch at /Users/mcranmer/PermaDocuments/julia/src/task.c:634
wait at ./task.jl:995
try_yieldto at ./task.jl:921
task_done_hook at ./task.jl:675
wait at ./task.jl:995
jfptr_task_done_hook_75409 at /Users/mcranmer/PermaDocuments/julia/usr/lib/julia/sys.dylib (unknown line)
task_done_hook at ./task.jl:675
_jl_invoke at /Users/mcranmer/PermaDocuments/julia/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/mcranmer/PermaDocuments/julia/src/gf.c:3077
jfptr_task_done_hook_75409 at /Users/mcranmer/PermaDocuments/julia/usr/lib/julia/sys.dylib (unknown line)
jl_apply at /Users/mcranmer/PermaDocuments/julia/src/./julia.h:1982 [inlined]
jl_finish_task at /Users/mcranmer/PermaDocuments/julia/src/task.c:320
_jl_invoke at /Users/mcranmer/PermaDocuments/julia/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/mcranmer/PermaDocuments/julia/src/gf.c:3077
start_task at /Users/mcranmer/PermaDocuments/julia/src/task.c:1249
jl_apply at /Users/mcranmer/PermaDocuments/julia/src/./julia.h:1982 [inlined]
jl_finish_task at /Users/mcranmer/PermaDocuments/julia/src/task.c:320
start_task at /Users/mcranmer/PermaDocuments/julia/src/task.c:1249
Quick updates:
with_stacksize(8 * 1024 * 1024) do
autodiff(
Reverse,
evaluator,
Duplicated(g.f.tree, g.extra.storage_tree),
Duplicated(g.f.dataset, g.extra.storage_dataset),
Const(g.f.options),
Const(g.f.idx),
Duplicated(output, doutput),
)
end
and this seems to fix things. It works with a full multi-threaded SymbolicRegression.jl search, being used by Optim.jl for optimizing constants – even with the new ParametricExpression that stores constants in multiple places.
Task
API is safe or not. So I posted on discourse here: https://discourse.julialang.org/t/raising-the-roof-increased-stack-size-shenanigans-in-enzyme/116511?u=milescranmerEdit: The default Julia stack size is 4MB. I am requesting 8MB in the task.
The stack size trick (/hack) is magic, Enzyme is rock solid now for me. So this is as good as closed. Thanks for all the help!!
P.S., It works so well that I wonder if you would consider putting in a modified stack size directly into Enzyme? The default Julia stack size of 4 MB (or 2 MB on 32-bit systems) seems tiny for what Enzyme needs: (posted an issue here: https://github.com/JuliaLang/julia/issues/54998). I feel like Enzyme should even request 32 MB to be safe.
It also would give more reliable stack overflow errors if your compilation thread always starts with the same stack size.
I think part of the difficult in debugging this was because the stack overflows were random:
I know that sometimes languages allocate threads with smaller stack sizes, so maybe that muddied things up as well (maybe a thread started doing the Enzyme compilation, and hit a stack overflow, and basically locked the compile cache forever)
P.P.S., I realised that it gets even trickier because of the caching. If one thread is compiling, hits a stack overflow, but cached some of its work, then theoretically could another thread get deeper in the AD process (since it no longer needs to travel through those particular frames)?
Maybe I'm imagining things but I also feel like the larger stack size made compilation much faster... Maybe it was from some threads dying off from stack overflows, and the larger stack just let the first one run to completion..
I do think we need to actually check what is creating such deep stacks here
Compilers are not supposed to be recursive and if they are we should transition to a worklist approach
I honestly don’t know. I think even without recursion you could hit this stack overflow, see my experiments here: https://discourse.julialang.org/t/experiments-with-julia-stack-sizes-and-enzyme/116511/2?u=milescranmer
If the workload is big enough, and Enzyme gets called deep enough in the existing stack, I think it’s doable for it to hit an overflow with Julia’s 4 MB stack size if the stack frames are hefty
For my case — I was calling Optim deep inside SymbolicRegression, and that Optim optimization (which is also quite deep with all the line search stuff) wraps an Enzyme-computed gradient of a loss function which itself calls deep into DynamicExpressions (which uses recursive evaluations).
Even creating a single Task
at the start of an enzyme call is probably a good idea, since it will start a fresh stack. Wdyt?
For fun can you set the environmental bar ENABLE_GDBLISTENER=1 and get the backtrace of the stack overflow in gdb. That would hopefully let us see what Julia functions cause relevant issues here
I don't think gdb is available on macOS unfortunately. Is lldb okay?
Yeah this works for lldb as well.
You can also now use the Julia profiler during compilation, since it doesn't hang anymore to get some idea of the stack traces involved
Just tried... The call stack is so deep that the profiler is refusing to show it 😆
Okay, ProfileView.jl to the rescue:
Okay, here is a cycle I found from scrolling the stack. You can see it starts at abstractinterpretation.jl:2370
, and then ends up there again:
file | method |
---|---|
./compiler/abstractinterpretation.jl:2370 |
Core.Compiler.abstract_eval_call(::Core.Compiler.NativeInterpreter, ::Expr, ::Vector{Core.Compiler.VarState}, ::Core.Compiler.InferenceState) |
./compiler/abstractinterpretation.jl:2913 |
Core.Compiler.abstract_eval_basic_statement(::Core.Compiler.NativeInterpreter, ::Any, ::Vector{Core.Compiler.VarState}, ::Core.Compiler.InferenceState) |
./compiler/abstractinterpretation.jl:2624 |
Core.Compiler.abstract_eval_statement(::Core.Compiler.NativeInterpreter, ::Any, ::Vector{Core.Compiler.VarState}, ::Core.Compiler.InferenceState) |
./compiler/abstractinterpretation.jl:2380 |
Core.Compiler.abstract_eval_statement_expr(::Core.Compiler.NativeInterpreter, ::Expr, ::Vector{Core.Compiler.VarState}, ::Core.Compiler.InferenceState) |
./compiler/abstractinterpretation.jl:2370 |
Core.Compiler.abstract_eval_call(::Core.Compiler.NativeInterpreter, ::Expr, ::Vector{Core.Compiler.VarState}, ::Core.Compiler.InferenceState) |
./compiler/abstractinterpretation.jl:2354 |
Core.Compiler.abstract_call(::Core.Compiler.NativeInterpreter, ::Core.Compiler.ArgInfo, ::Core.Compiler.InferenceState) |
./compiler/abstractinterpretation.jl:2162 |
Core.Compiler.abstract_call(::Core.Compiler.NativeInterpreter, ::Core.Compiler.ArgInfo, ::Core.Compiler.StmtInfo, ::Core.Compiler.InferenceState) |
./compiler/abstractinterpretation.jl:2169 |
Core.Compiler.abstract_call(::Core.Compiler.NativeInterpreter, ::Core.Compiler.ArgInfo, ::Core.Compiler.StmtInfo, ::Core.Compiler.InferenceState, ::Int64) |
./compiler/abstractinterpretation.jl:2087 |
Core.Compiler.abstract_call_known(::Core.Compiler.NativeInterpreter, ::Any, ::Core.Compiler.ArgInfo, ::Core.Compiler.StmtInfo, ::Core.Compiler.InferenceState, ::Int64) |
./compiler/abstractinterpretation.jl:95 |
Core.Compiler.abstract_call_gf_by_type(::Core.Compiler.NativeInterpreter, ::Any, ::Core.Compiler.ArgInfo, ::Core.Compiler.StmtInfo, ::Any, ::Core.Compiler.InferenceState, ::Int64) |
./compiler/abstractinterpretation.jl:629 |
Core.Compiler.abstract_call_method(::Core.Compiler.NativeInterpreter, ::Method, ::Any, ::Core.SimpleVector, ::Bool, ::Core.Compiler.StmtInfo, ::Core.Compiler.InferenceState) |
./compiler/typeinfer.jl:930 |
Core.Compiler.typeinf_edge(::Core.Compiler.NativeInterpreter, ::Method, ::Any, ::Core.SimpleVector, ::Core.Compiler.InferenceState) |
./compiler/typeinfer.jl:216 |
Core.Compiler.typeinf(::Core.Compiler.NativeInterpreter, ::Core.Compiler.InferenceState) |
./compiler/typeinfer.jl:247 |
Core.Compiler._typeinf(::Core.Compiler.NativeInterpreter, ::Core.Compiler.InferenceState) |
./compiler/abstractinterpretation.jl:3186 |
Core.Compiler.typeinf_nocycle(::Core.Compiler.NativeInterpreter, ::Core.Compiler.InferenceState) |
./compiler/abstractinterpretation.jl:3098 |
Core.Compiler.typeinf_local(::Core.Compiler.NativeInterpreter, ::Core.Compiler.InferenceState) |
./compiler/abstractinterpretation.jl:2913 |
Core.Compiler.abstract_eval_basic_statement(::Core.Compiler.NativeInterpreter, ::Any, ::Vector{Core.Compiler.VarState}, ::Core.Compiler.InferenceState) |
./compiler/abstractinterpretation.jl:2624 |
Core.Compiler.abstract_eval_statement(::Core.Compiler.NativeInterpreter, ::Any, ::Vector{Core.Compiler.VarState}, ::Core.Compiler.InferenceState) |
./compiler/abstractinterpretation.jl:2380 |
Core.Compiler.abstract_eval_statement_expr(::Core.Compiler.NativeInterpreter, ::Expr, ::Vector{Core.Compiler.VarState}, ::Core.Compiler.InferenceState) |
./compiler/abstractinterpretation.jl:2370 |
Core.Compiler.abstract_eval_call(::Core.Compiler.NativeInterpreter, ::Expr, ::Vector{Core.Compiler.VarState}, ::Core.Compiler.InferenceState) |
And, here's a flat profile, sorted by count, and with the last 300 rows:
Also, if I may, can you explain a bit about what sorts of things you are looking for here, and what you can draw from the profile results? I want to learn 🙂 It has taken me a while to collect all of the information in this thread so I'd really like to come away with a deeper understanding of Julia/LLVM/Enzyme internals!
Could you export the data with PProf.jl? Generally I would be looking for Enzyme related cycles.
The inference cycle you found is annoying, but there is ongoing work to transition inference more to a worklist approach.
Here you go: profile.pb.gz
Friendly ping on this
The old issue #1018 from August was getting a bit lengthy so I'm moving to a new issue – feel free to close that one.
This relates to my one-year effort to try to get Enzyme.jl working as one of the AD backends for DynamicExpressions.jl, SymbolicRegression.jl, and PySR. The current status is:
(expand) I've boiled down the MWE to the following code which replicates the issues I am seeing:
```julia using Enzyme ################################################################################ ### OperatorEnum.jl ################################################################################ struct OperatorEnum{B,U} binops::B unaops::U end ################################################################################ ################################################################################ ### Equation.jl ################################################################################ mutable struct Node{T} degree::UInt8 # 0 for constant/variable, 1 for cos/sin, 2 for +/* etc. constant::Bool # false if variable val::Union{T,Nothing} # If is a constant, this stores the actual value # ------------------- (possibly undefined below) feature::UInt16 # If is a variable (e.g., x in cos(x)), this stores the feature index. op::UInt8 # If operator, this is the index of the operator in operators.binops, or operators.unaops l::Node{T} # Left child node. Only defined for degree=1 or degree=2. r::Node{T} # Right child node. Only defined for degree=2. Node(d::Integer, c::Bool, v::_T) where {_T} = new{_T}(UInt8(d), c, v) Node(::Type{_T}, d::Integer, c::Bool, v::_T) where {_T} = new{_T}(UInt8(d), c, v) Node(::Type{_T}, d::Integer, c::Bool, v::Nothing, f::Integer) where {_T} = new{_T}(UInt8(d), c, v, UInt16(f)) Node(d::Integer, c::Bool, v::Nothing, f::Integer, o::Integer, l::Node{_T}) where {_T} = new{_T}(UInt8(d), c, v, UInt16(f), UInt8(o), l) Node(d::Integer, c::Bool, v::Nothing, f::Integer, o::Integer, l::Node{_T}, r::Node{_T}) where {_T} = new{_T}(UInt8(d), c, v, UInt16(f), UInt8(o), l, r) end function Node(::Type{T}; val::T1=nothing, feature::T2=nothing)::Node{T} where {T,T1,T2} if T2 <: Nothing !(T1 <: T) && (val = convert(T, val)) return Node(T, 0, true, val) else return Node(T, 0, false, nothing, feature) end end Node(op::Integer, l::Node{T}) where {T} = Node(1, false, nothing, 0, op, l) Node(op::Integer, l::Node{T}, r::Node{T}) where {T} = Node(2, false, nothing, 0, op, l, r) ################################################################################ ################################################################################ ### Utils.jl ################################################################################ @inline function fill_similar(value, array, args...) out_array = similar(array, args...) out_array .= value return out_array end is_bad_array(array) = !(isempty(array) || isfinite(sum(array))) function is_constant(tree::Node) if tree.degree == 0 return tree.constant elseif tree.degree == 1 return is_constant(tree.l) else return is_constant(tree.l) && is_constant(tree.r) end end ################################################################################ ################################################################################ ### EvaluateEquation.jl ################################################################################ struct ResultOk{A<:AbstractArray} x::A ok::Bool end function eval_tree_array(tree::Node{T}, cX::AbstractMatrix{T}, operators::OperatorEnum, fuse_level=Val(2)) where {T<:Number} result = _eval_tree_array(tree, cX, operators, fuse_level) return (result.x, result.ok && !is_bad_array(result.x)) end counttuple(::Type{<:NTuple{N,Any}}) where {N} = N get_nuna(::Type{<:OperatorEnum{B,U}}) where {B,U} = counttuple(U) get_nbin(::Type{<:OperatorEnum{B}}) where {B} = counttuple(B) @generated function _eval_tree_array(tree::Node{T}, cX::AbstractMatrix{T}, operators::OperatorEnum, ::Val{fuse_level})::ResultOk where {T<:Number,fuse_level} nuna = get_nuna(operators) nbin = get_nbin(operators) quote # First, we see if there are only constants in the tree - meaning # we can just return the constant result. if tree.degree == 0 return deg0_eval(tree, cX) elseif is_constant(tree) # Speed hack for constant trees. const_result = _eval_constant_tree(tree, operators)::ResultOk{Vector{T}} !const_result.ok && return ResultOk(similar(cX, axes(cX, 2)), false) return ResultOk(fill_similar(const_result.x[], cX, axes(cX, 2)), true) elseif tree.degree == 1 op_idx = tree.op # This @nif lets us generate an if statement over choice of operator, # which means the compiler will be able to completely avoid type inference on operators. return Base.Cartesian.@nif( $nuna, i -> i == op_idx, i -> let op = operators.unaops[i] if fuse_level > 1 && tree.l.degree == 2 && tree.l.l.degree == 0 && tree.l.r.degree == 0 # op(op2(x, y)), where x, y, z are constants or variables. l_op_idx = tree.l.op Base.Cartesian.@nif( $nbin, j -> j == l_op_idx, j -> let op_l = operators.binops[j] deg1_l2_ll0_lr0_eval(tree, cX, op, op_l) end, ) elseif fuse_level > 1 && tree.l.degree == 1 && tree.l.l.degree == 0 # op(op2(x)), where x is a constant or variable. l_op_idx = tree.l.op Base.Cartesian.@nif( $nuna, j -> j == l_op_idx, j -> let op_l = operators.unaops[j] deg1_l1_ll0_eval(tree, cX, op, op_l) end, ) else # op(x), for any x. result = _eval_tree_array(tree.l, cX, operators, Val(fuse_level)) !result.ok && return result deg1_eval(result.x, op) end end ) else op_idx = tree.op return Base.Cartesian.@nif( $nbin, i -> i == op_idx, i -> let op = operators.binops[i] if fuse_level > 1 && tree.l.degree == 0 && tree.r.degree == 0 deg2_l0_r0_eval(tree, cX, op) elseif tree.r.degree == 0 result_l = _eval_tree_array(tree.l, cX, operators, Val(fuse_level)) !result_l.ok && return result_l # op(x, y), where y is a constant or variable but x is not. deg2_r0_eval(tree, result_l.x, cX, op) elseif tree.l.degree == 0 result_r = _eval_tree_array(tree.r, cX, operators, Val(fuse_level)) !result_r.ok && return result_r # op(x, y), where x is a constant or variable but y is not. deg2_l0_eval(tree, result_r.x, cX, op) else result_l = _eval_tree_array(tree.l, cX, operators, Val(fuse_level)) !result_l.ok && return result_l result_r = _eval_tree_array(tree.r, cX, operators, Val(fuse_level)) !result_r.ok && return result_r # op(x, y), for any x or y deg2_eval(result_l.x, result_r.x, op) end end ) end end end function deg2_eval( cumulator_l::AbstractVector{T}, cumulator_r::AbstractVector{T}, op::F )::ResultOk where {T<:Number,F} @inbounds @simd for j in eachindex(cumulator_l) x = op(cumulator_l[j], cumulator_r[j])::T cumulator_l[j] = x end return ResultOk(cumulator_l, true) end function deg1_eval( cumulator::AbstractVector{T}, op::F )::ResultOk where {T<:Number,F} @inbounds @simd for j in eachindex(cumulator) x = op(cumulator[j])::T cumulator[j] = x end return ResultOk(cumulator, true) end function deg0_eval(tree::Node{T}, cX::AbstractMatrix{T})::ResultOk where {T<:Number} if tree.constant return ResultOk(fill_similar(tree.val::T, cX, axes(cX, 2)), true) else return ResultOk(cX[tree.feature, :], true) end end function deg1_l2_ll0_lr0_eval( tree::Node{T}, cX::AbstractMatrix{T}, op::F, op_l::F2 ) where {T<:Number,F,F2} if tree.l.l.constant && tree.l.r.constant val_ll = tree.l.l.val::T val_lr = tree.l.r.val::T x_l = op_l(val_ll, val_lr)::T x = op(x_l)::T return ResultOk(fill_similar(x, cX, axes(cX, 2)), true) elseif tree.l.l.constant val_ll = tree.l.l.val::T feature_lr = tree.l.r.feature cumulator = similar(cX, axes(cX, 2)) @inbounds @simd for j in axes(cX, 2) x_l = op_l(val_ll, cX[feature_lr, j])::T x = isfinite(x_l) ? op(x_l)::T : T(Inf) cumulator[j] = x end return ResultOk(cumulator, true) elseif tree.l.r.constant feature_ll = tree.l.l.feature val_lr = tree.l.r.val::T cumulator = similar(cX, axes(cX, 2)) @inbounds @simd for j in axes(cX, 2) x_l = op_l(cX[feature_ll, j], val_lr)::T x = isfinite(x_l) ? op(x_l)::T : T(Inf) cumulator[j] = x end return ResultOk(cumulator, true) else feature_ll = tree.l.l.feature feature_lr = tree.l.r.feature cumulator = similar(cX, axes(cX, 2)) @inbounds @simd for j in axes(cX, 2) x_l = op_l(cX[feature_ll, j], cX[feature_lr, j])::T x = isfinite(x_l) ? op(x_l)::T : T(Inf) cumulator[j] = x end return ResultOk(cumulator, true) end end # op(op2(x)) for x variable or constant function deg1_l1_ll0_eval( tree::Node{T}, cX::AbstractMatrix{T}, op::F, op_l::F2 ) where {T<:Number,F,F2} if tree.l.l.constant val_ll = tree.l.l.val::T x_l = op_l(val_ll)::T x = op(x_l)::T return ResultOk(fill_similar(x, cX, axes(cX, 2)), true) else feature_ll = tree.l.l.feature cumulator = similar(cX, axes(cX, 2)) @inbounds @simd for j in axes(cX, 2) x_l = op_l(cX[feature_ll, j])::T x = isfinite(x_l) ? op(x_l)::T : T(Inf) cumulator[j] = x end return ResultOk(cumulator, true) end end # op(x, y) for x and y variable/constant function deg2_l0_r0_eval( tree::Node{T}, cX::AbstractMatrix{T}, op::F ) where {T<:Number,F} if tree.l.constant && tree.r.constant val_l = tree.l.val::T val_r = tree.r.val::T x = op(val_l, val_r)::T return ResultOk(fill_similar(x, cX, axes(cX, 2)), true) elseif tree.l.constant cumulator = similar(cX, axes(cX, 2)) val_l = tree.l.val::T feature_r = tree.r.feature @inbounds @simd for j in axes(cX, 2) x = op(val_l, cX[feature_r, j])::T cumulator[j] = x end return ResultOk(cumulator, true) elseif tree.r.constant cumulator = similar(cX, axes(cX, 2)) feature_l = tree.l.feature val_r = tree.r.val::T @inbounds @simd for j in axes(cX, 2) x = op(cX[feature_l, j], val_r)::T cumulator[j] = x end return ResultOk(cumulator, true) else cumulator = similar(cX, axes(cX, 2)) feature_l = tree.l.feature feature_r = tree.r.feature @inbounds @simd for j in axes(cX, 2) x = op(cX[feature_l, j], cX[feature_r, j])::T cumulator[j] = x end return ResultOk(cumulator, true) end end # op(x, y) for x variable/constant, y arbitrary function deg2_l0_eval( tree::Node{T}, cumulator::AbstractVector{T}, cX::AbstractArray{T}, op::F ) where {T<:Number,F} if tree.l.constant val = tree.l.val::T @inbounds @simd for j in eachindex(cumulator) x = op(val, cumulator[j])::T cumulator[j] = x end return ResultOk(cumulator, true) else feature = tree.l.feature @inbounds @simd for j in eachindex(cumulator) x = op(cX[feature, j], cumulator[j])::T cumulator[j] = x end return ResultOk(cumulator, true) end end # op(x, y) for x arbitrary, y variable/constant function deg2_r0_eval( tree::Node{T}, cumulator::AbstractVector{T}, cX::AbstractArray{T}, op::F ) where {T<:Number,F} if tree.r.constant val = tree.r.val::T @inbounds @simd for j in eachindex(cumulator) x = op(cumulator[j], val)::T cumulator[j] = x end return ResultOk(cumulator, true) else feature = tree.r.feature @inbounds @simd for j in eachindex(cumulator) x = op(cumulator[j], cX[feature, j])::T cumulator[j] = x end return ResultOk(cumulator, true) end end @generated function _eval_constant_tree(tree::Node{T}, operators::OperatorEnum) where {T<:Number} nuna = get_nuna(operators) nbin = get_nbin(operators) quote if tree.degree == 0 return deg0_eval_constant(tree)::ResultOk{Vector{T}} elseif tree.degree == 1 op_idx = tree.op return Base.Cartesian.@nif( $nuna, i -> i == op_idx, i -> deg1_eval_constant( tree, operators.unaops[i], operators )::ResultOk{Vector{T}} ) else op_idx = tree.op return Base.Cartesian.@nif( $nbin, i -> i == op_idx, i -> deg2_eval_constant( tree, operators.binops[i], operators )::ResultOk{Vector{T}} ) end end end @inline function deg0_eval_constant(tree::Node{T}) where {T<:Number} output = tree.val::T return ResultOk([output], true)::ResultOk{Vector{T}} end function deg1_eval_constant(tree::Node{T}, op::F, operators::OperatorEnum) where {T<:Number,F} result = _eval_constant_tree(tree.l, operators) !result.ok && return result output = op(result.x[])::T return ResultOk([output], isfinite(output))::ResultOk{Vector{T}} end function deg2_eval_constant(tree::Node{T}, op::F, operators::OperatorEnum) where {T<:Number,F} cumulator = _eval_constant_tree(tree.l, operators) !cumulator.ok && return cumulator result_r = _eval_constant_tree(tree.r, operators) !result_r.ok && return result_r output = op(cumulator.x[], result_r.x[])::T return ResultOk([output], isfinite(output))::ResultOk{Vector{T}} end ################################################################################ ```Now, we can see that the forward pass works okay:
This evaluates
x1 + cos(x2)
over 100 random rows. BothVal(1)
andVal(2)
(fuse_level=1
and=2
, respectively) here will work and produce the same output. Please see withctrl-F
which parts it activates in the code – it basically just turns on a couple branches related to "fused" operators (e.g.,sin(exp(x))
evaluated over the data inside a single loop).Now, if I try compiling the reverse-mode gradient with respect to the input data for fuse-level 1:
This takes about 1 minute to compile. But, once it's compiled, it's pretty fast.
However, if I switch on some of the optimizations (
fuse_level=2
):This seems to hang forever. I left it going for about a day and came back and it was still running. I'm assuming it will finish eventually, but it's obviously not a good solution as the existing AD backends with forward-mode auto-diff compile in under a second. And if the user changes data types or operators, it will need to recompile again.
If I force it to quit with
ctl-\
, I see various LLVM calls:Any idea how to get this scaling better? It seems like some step of the compilation is hanging here and it is scaling exponentially with the number of branches.
Edit: Updated code MWE to further reduce it.