JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.87k stars 5.49k forks source link

New precompilation crashes on Julia 1.11-rc1 #55147

Closed MilesCranmer closed 3 months ago

MilesCranmer commented 4 months ago

I'm seeing some precompilation crashes on Julia 1.11-rc1 when precompiling DynamicExpressions.jl with DispatchDoctor in-use on the package. (DispatchDoctor.jl is basically a package that calls promote_op on each function and uses that to flag type instabilities.)

Here is the traceback:

ERROR: The following 1 direct dependency failed to precompile:

DynamicExpressions 

Failed to precompile DynamicExpressions [a40a106e-89c9-4ca8-8020-a735e8728b6b] to "/Users/mcranmer/.julia/compiled/v1.11/DynamicExpressions/jl_cQE0v5".
[39250] signal 4: Illegal instruction: 4
in expression starting at /Users/mcranmer/PermaDocuments/SymbolicRegressionMonorepo/DynamicExpressions.jl/src/DynamicExpressions.jl:120
_eval_tree_array at /Users/mcranmer/PermaDocuments/DispatchDoctor.jl/src/stabilization.jl:301
macro expansion at /Users/mcranmer/PermaDocuments/SymbolicRegressionMonorepo/DynamicExpressions.jl/src/Evaluate.jl:92 [inlined]
#eval_tree_array#2 at /Users/mcranmer/PermaDocuments/DispatchDoctor.jl/src/stabilization.jl:306
eval_tree_array at /Users/mcranmer/PermaDocuments/DispatchDoctor.jl/src/stabilization.jl:301
#test_all_combinations#1 at /Users/mcranmer/PermaDocuments/SymbolicRegressionMonorepo/DynamicExpressions.jl/src/precompile.jl:7
test_all_combinations at /Users/mcranmer/PermaDocuments/SymbolicRegressionMonorepo/DynamicExpressions.jl/src/precompile.jl:22 [inlined]
macro expansion at /Users/mcranmer/PermaDocuments/SymbolicRegressionMonorepo/DynamicExpressions.jl/src/precompile.jl:168 [inlined]
macro expansion at /Users/mcranmer/.julia/packages/PrecompileTools/L8A3n/src/workloads.jl:78 [inlined]
macro expansion at /Users/mcranmer/PermaDocuments/SymbolicRegressionMonorepo/DynamicExpressions.jl/src/precompile.jl:153 [inlined]
macro expansion at /Users/mcranmer/.julia/packages/PrecompileTools/L8A3n/src/workloads.jl:140 [inlined]
#do_precompilation#2 at /Users/mcranmer/PermaDocuments/SymbolicRegressionMonorepo/DynamicExpressions.jl/src/precompile.jl:138
do_precompilation at /Users/mcranmer/PermaDocuments/SymbolicRegressionMonorepo/DynamicExpressions.jl/src/precompile.jl:161
unknown function (ip: 0x11570c053)
jl_apply at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/./julia.h:2156 [inlined]
do_call at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/interpreter.c:126
eval_stmt_value at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/interpreter.c:174
eval_body at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/interpreter.c:663
jl_interpret_toplevel_thunk at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/interpreter.c:821
jl_toplevel_eval_flex at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/toplevel.c:943
jl_eval_module_expr at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/toplevel.c:215 [inlined]
jl_toplevel_eval_flex at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/toplevel.c:743
jl_toplevel_eval_flex at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/toplevel.c:886
ijl_toplevel_eval at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/toplevel.c:952 [inlined]
ijl_toplevel_eval_in at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/toplevel.c:994
eval at ./boot.jl:429 [inlined]
include_string at ./loading.jl:2543
_include at ./loading.jl:2603
include at ./Base.jl:558 [inlined]
include_package_for_output at ./loading.jl:2721
jfptr_include_package_for_output_69600.1 at /Users/mcranmer/.julia/juliaup/julia-1.11.0-rc1+0.aarch64.apple.darwin14/lib/julia/sys.dylib (unknown line)
jl_apply at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/./julia.h:2156 [inlined]
do_call at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/interpreter.c:126
eval_stmt_value at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/interpreter.c:174
eval_body at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/interpreter.c:663
jl_interpret_toplevel_thunk at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/interpreter.c:821
jl_toplevel_eval_flex at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/toplevel.c:943
jl_toplevel_eval_flex at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/toplevel.c:886
ijl_toplevel_eval at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/toplevel.c:952 [inlined]
ijl_toplevel_eval_in at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/toplevel.c:994
eval at ./boot.jl:429 [inlined]
include_string at ./loading.jl:2543
include_string at ./loading.jl:2553 [inlined]
exec_options at ./client.jl:316
_start at ./client.jl:526
jfptr__start_71098.1 at /Users/mcranmer/.julia/juliaup/julia-1.11.0-rc1+0.aarch64.apple.darwin14/lib/julia/sys.dylib (unknown line)
jl_apply at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/./julia.h:2156 [inlined]
true_main at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/jlapi.c:900
jl_repl_entrypoint at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-11/src/jlapi.c:1059
Allocations: 84690299 (Pool: 84689426; Big: 873); GC: 4

versioninfo:

julia> versioninfo()
Julia Version 1.11.0-rc1
Commit 3a35aec36d1 (2024-06-25 10:23 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M1 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
Threads: 6 default, 0 interactive, 3 GC (on 6 virtual cores)
Environment:
  JULIA_FORMATTER_SO = /Users/mcranmer/julia_formatter.so
  JULIA_NUM_THREADS = auto
  JULIA_OPTIMIZE = 3
  JULIA_EDITOR = code

I installed Julia with juliaup. To reproduce this issue, you can run the following code:

cd $(mktemp -d)
# Install package
julia +1.11 --startup-file=no --project=. -e 'using Pkg; pkg"add Preferences DynamicExpressions DispatchDoctor@v0.4.10"'
# Enable DispatchDoctor.jl
julia +1.11 --startup-file=no --project=. -e 'using Preferences; set_preferences!("DynamicExpressions", "instability_check" => "warn")'
# Precompile:
julia +1.11 --startup-file=no --project=. -e 'using Pkg; pkg"precompile"'

I can prevent this error with the following PR on DispatchDoctor.jl: https://github.com/MilesCranmer/DispatchDoctor.jl/compare/094b1651eeef3fb2017be46a48f0da13724e1123~...b223a4d033dd3d17901879871c508ae33cfd550a. The PR basically amounts to changing some functions into @generated form:

- map_specializing_typeof(args...) = map(specializing_typeof, args)
+ map_specializing_typeof(args::Tuple) = map(specializing_typeof, args)

- _promote_op(f, S::Type...) = Base.promote_op(f, S...)
- _promote_op(f, S::Tuple) = _promote_op(f, S...)
+ function _promote_op(f, S::Vararg{Type})
+     if @generated
+         :(Base.promote_op(f, S...))
+     else
+         Base.promote_op(f, S...)
+     end
+ end

However, it doesn't seem like DispatchDoctor.jl or DynamicExpressions.jl is doing anything wrong, so I'm not sure what's going on. Both before and after seem to be valid Julia code. Also, the downside of that PR is it introduces a type instability in Zygote autodiff, and there doesn't seem to be a way around it that both prevents the segfault while also eliminating the type instability.

I don't understand the conditions for reproducing this, so this is so far my only example. When I make various tweaks to _promote_op within DispatchDoctor.jl, I seem to end up with different segfaults – one of which is the Unreachable reached bug.

cc @avik-pal

MilesCranmer commented 3 months ago

@KristofferC I mean of course I agree with the general sentiment but some of the code it's crashing while precompiling involve in(::Any, ::Tuple) calls so it's at least plausible.

MilesCranmer commented 3 months ago

@aviatesk I still see the bug. Can you provide details on your system? And to confirm, you saw the Illegal instruction, rather than a different bug? Was it the same stacktrace?

aviatesk commented 3 months ago

Yes, I am seeing exactly the same segfault, but the patch I applied yesterday wrongly seems to have eliminated the segfault due to the order of precompilation (specifically, changes in Preferences significantly alter DD's behavior, but preference changes do not necessarily trigger re-precompilation). If I clear the precompilation cache manually, the same error occurs even with both patches applied.

aviatesk commented 3 months ago

Initially, I thought it was caused by the impureness of the @generated function, but it seems that is not the case because the issue still persists even if we make DD-functionality unreachable from any of the generators. The problem is probably very subtle and due to the instability of Core.Compiler.return_type itself. Specifically, my PR enables the concrete evaluation of promote_op, giving inference a more chance to use and cache of Core.Compiler.return_type, which I believe is causing the issue (though I'm not entirely sure how it ends up surfacing as the segfault). I have confirmed that applying @invokelatest to promote_op to stop inference, or using Base.infer_return_type, can avoid this issue. The use of Base.promote_op (and internally Core.Compiler.return_type) should be avoided in the first place (as stated in the documentation of Base.promote_op, xref), especially in packages like DD that change its behavior based on type inference results. Therefore, the best solution might be for DD to stop using Base.promote_op and switch to using Base.infer_return_type.

aviatesk commented 3 months ago

This particular issue may not be related to @generated functions (though it is certainly a good idea to avoid calling type inference reflection from generators). The problem is likely the use of Base.promote_op itself, and applying the following patch can suppress the segfault while maintaining DD's behavior:

diff --git a/src/stabilization.jl b/src/stabilization.jl
index fdfb60f..310fe7d 100644
--- a/src/stabilization.jl
+++ b/src/stabilization.jl
@@ -115,7 +115,7 @@ function _stabilize_all(ex::Expr, downward_metadata::DownwardMetadata; kws...)
                 @assert length(upward_metadata.unused_macros) == length(upward_metadata.macro_keys)

                 new_ex = Expr(:macrocall, upward_metadata.unused_macros[end]..., inner_ex)
-                new_upward_metadata = UpwardMetadata(; 
+                new_upward_metadata = UpwardMetadata(;
                     matching_function = upward_metadata.matching_function,
                     unused_macros = upward_metadata.unused_macros[1:end-1],
                     macro_keys = upward_metadata.macro_keys[1:end-1],
diff --git a/src/utils.jl b/src/utils.jl
index 197d0be..66fb235 100644
--- a/src/utils.jl
+++ b/src/utils.jl
@@ -97,7 +97,7 @@ specializing_typeof(::Type{T}) where {T} = Type{T}
 specializing_typeof(::Val{T}) where {T} = Val{T}
 map_specializing_typeof(args...) = map(specializing_typeof, args)

-_promote_op(f, S::Type...) = Base.promote_op(f, S...)
+_promote_op(f, S::Type...) = Base.infer_return_type(f, S)
 _promote_op(f, S::Tuple) = _promote_op(f, S...)
 @static if isdefined(Core, :kwcall)
     function _promote_op(
@@ -120,7 +120,7 @@ return false for `Union{}`, so that errors can propagate.
 # so we implement a workaround.
 @inline type_instability(::Type{Type{T}}) where {T} = type_instability(T)

-@generated function type_instability_limit_unions(
+function type_instability_limit_unions(
     ::Type{T}, ::Val{union_limit}
 ) where {T,union_limit}
     if T isa UnionAll

While we ideally need to improve the reliability of Core.Compiler.return_type, it is a rather complex and difficult problem (so it still remains to be buggy). In the meantime, it would be better to avoid its use on the DD side to circumvent the issue.

MilesCranmer commented 3 months ago

Thanks for all of this, I appreciate it.

So, looking into this alternative, one issue with Base.infer_return_type is that it is not known at compile time, meaning that LLVM can no longer strip the instability check for stable functions:

julia> using DispatchDoctor  # With the patch in https://github.com/JuliaLang/julia/issues/55147#issuecomment-2262472199

julia> @stable f(x) = x
f (generic function with 1 method)

julia> @code_llvm f(1)
; Function Signature: f(Int64)
;  @ /Users/mcranmer/PermaDocuments/DispatchDoctor.jl/src/stabilization.jl:301 within `f`
define i64 @julia_f_6223(i64 signext %"x::Int64") #0 {
top:
  %jlcallframe1 = alloca [6 x ptr], align 8
  %gcframe2 = alloca [8 x ptr], align 16
  call void @llvm.memset.p0.i64(ptr align 16 %gcframe2, i8 0, i64 64, i1 true)
  %0 = getelementptr inbounds ptr, ptr %gcframe2, i64 2
  %"new::OptimizationParams" = alloca { i8, i64, i64, i64, i64, i64, i8, i8, i8 }, align 8
  %1 = alloca { i64, { ptr, [1 x i64] }, ptr, { i64, i64, i64, i64, i64, i8, i8, i8, i8, i8 }, { i8, i64, i64, i64, i64, i64, i8, i8, i8 } }, align 8
  %pgcstack = call ptr inttoptr (i64 6490275500 to ptr)(i64 261) #5
  store i64 24, ptr %gcframe2, align 16
  %task.gcstack = load ptr, ptr %pgcstack, align 8
  %frame.prev = getelementptr inbounds ptr, ptr %gcframe2, i64 1
  store ptr %task.gcstack, ptr %frame.prev, align 8
  store ptr %gcframe2, ptr %pgcstack, align 8
; ┌ @ /Users/mcranmer/PermaDocuments/DispatchDoctor.jl/src/utils.jl:101 within `_promote_op` @ /Users/mcranmer/PermaDocuments/DispatchDoctor.jl/src/utils.jl:100
; │┌ @ reflection.jl:1872 within `infer_return_type`
; ││┌ @ reflection.jl:2586 within `get_world_counter`
     %2 = call i64 @jlplt_ijl_get_world_counter_6226_got.jit()
; ││└
; ││┌ @ compiler/types.jl:373 within `NativeInterpreter`
; │││┌ @ compiler/types.jl:321 within `OptimizationParams`
; ││││┌ @ compiler/utilities.jl:506 within `inlining_enabled`
; │││││┌ @ options.jl:68 within `JLOptions`
; ││││││┌ @ pointer.jl:153 within `unsafe_load` @ pointer.jl:153
         %pointerref.sroa.1.0.copyload = load i8, ptr getelementptr inbounds (i8, ptr @jl_options.found.jit, i64 110), align 2
; │││││└└
; │││││┌ @ promotion.jl:483 within `==` @ promotion.jl:639
        %3 = icmp eq i8 %pointerref.sroa.1.0.copyload, 1
; ││││└└
; ││││ @ compiler/types.jl:321 within `OptimizationParams` @ compiler/types.jl:321
; ││││┌ @ compiler/types.jl:341 within `#OptimizationParams#314`
; │││││┌ @ compiler/types.jl:309 within `OptimizationParams`
        %4 = zext i1 %3 to i8
        store i8 %4, ptr %"new::OptimizationParams", align 8
        %5 = getelementptr inbounds { i8, i64, i64, i64, i64, i64, i8, i8, i8 }, ptr %"new::OptimizationParams", i64 0, i32 1
        store <2 x i64> <i64 100, i64 1000>, ptr %5, align 8
        %6 = getelementptr inbounds { i8, i64, i64, i64, i64, i64, i8, i8, i8 }, ptr %"new::OptimizationParams", i64 0, i32 3
        store <2 x i64> <i64 250, i64 20>, ptr %6, align 8
        %7 = getelementptr inbounds { i8, i64, i64, i64, i64, i64, i8, i8, i8 }, ptr %"new::OptimizationParams", i64 0, i32 5
        store i64 32, ptr %7, align 8
        %8 = getelementptr inbounds { i8, i64, i64, i64, i64, i64, i8, i8, i8 }, ptr %"new::OptimizationParams", i64 0, i32 6
        store i8 1, ptr %8, align 8
        %9 = getelementptr inbounds { i8, i64, i64, i64, i64, i64, i8, i8, i8 }, ptr %"new::OptimizationParams", i64 0, i32 7
        store i8 0, ptr %9, align 1
        %10 = getelementptr inbounds { i8, i64, i64, i64, i64, i64, i8, i8, i8 }, ptr %"new::OptimizationParams", i64 0, i32 8
        store i8 0, ptr %10, align 2
; │││└└└
     call void @"j_#NativeInterpreter#315_6236"(ptr noalias nocapture noundef nonnull sret({ i64, { ptr, [1 x i64] }, ptr, { i64, i64, i64, i64, i64, i8, i8, i8, i8, i8 }, { i8, i64, i64, i64, i64, i64, i8, i8, i8 } }) %1, ptr noalias nocapture noundef nonnull %0, ptr nocapture nonnull readonly @"_j_const#10", ptr nocapture nonnull readonly %"new::OptimizationParams", i64 zeroext %2)
; ││└
    %11 = call nonnull ptr @"j_#infer_return_type#34_6238"(i64 zeroext %2, ptr nocapture nonnull readonly %1, ptr nonnull readonly @"jl_global#6239.jit", ptr nonnull readonly @"jl_global#6240.jit")
; └└
; ... (truncated)

Whereas with the current unpatched use of Base.promote_op, the instability check can be removed by the compiler:

julia> @code_llvm f(1)
;  @ /Users/mcranmer/PermaDocuments/DispatchDoctor.jl/src/stabilization.jl:301 within `f`
define i64 @julia_f_1198(i64 signext %0) #0 {
top:
;  @ /Users/mcranmer/PermaDocuments/DispatchDoctor.jl/src/stabilization.jl:306 within `f`
  ret i64 %0
}

Ideally I would like for DispatchDoctor to use whatever list comprehensions and map(f, x) use for type inference, as those give types known at compile time. My assumption was that this is what Base.promote_op was used for. Do those methods use something else?

Also, let me know if I can do anything to help with debugging Core.Compiler.return_type.

aviatesk commented 3 months ago

Ideally I would like for DispatchDoctor to use whatever list comprehensions and map(f, x) use for type inference, as those give types known at compile time. My assumption was that this is what Base.promote_op was used for.

Well, this is exactly why Core.Compiler.return_type is buggy and its use should be avoided if possible. Maybe reading through the documentation of promote_op or the discussion at https://github.com/JuliaLang/julia/pull/44340 may clarify it. It is possible to debug Core.Compiler.return_type, but this is a very subtle issue, and debugging it is quite a laborious task. I think the problem might be due to the difference between the world that Core.Compiler.return_type sees and the actual runtime world, but I am not fully sure. This issue might only occur during precompilation.

aviatesk commented 3 months ago

List comprehensions and map reluctantly use Core.Compiler.return_type for cases where the iterator they return has 0 elements, but probably that use should be replaced with Base.infer_return_type (although this might not be possible from a performance perspective). Having said that they only change the type of the iterator they return based on type inference, so the impact of Core.Compiler.return_type unreliability is minimal. Unfortunately, in the case of DD, the behavior changes based on the results of type inference significantly, such as printing or raising exceptions, leading to the worst-case scenario of a segfault. When Core.Compiler.return_type returns different results, it may hit branches that should not be executed or vice versa.

MilesCranmer commented 3 months ago

Here's an example of what I mean with map:

julia> function fake_promote_op(f::F, args...) where {F}
           x = map(_ -> f(args...), 1:0)
           return eltype(x)
       end

Note that this never actually calls f(args...). All it does is type inference, because the 1:0 is empty.

However, this is completely type stable, known at compile time, and is correct:

julia> fake_promote_op(+, 1.0)
Float64

julia> @code_typed fake_promote_op(+, 1.0)
CodeInfo(
1 ─     return Float64
) => Type{Float64}

julia> @code_llvm fake_promote_op(+, 1.0)
;  @ REPL[11]:1 within `fake_promote_op`
define nonnull {}* @julia_fake_promote_op_1386(double %0) #0 {
top:
;  @ REPL[11]:3 within `fake_promote_op`
  ret {}* inttoptr (i64 4823715024 to {}*)
}

Should I use this fake_promote_op instead of Base.promote_op?

MilesCranmer commented 3 months ago

List comprehensions and map reluctantly use Core.Compiler.return_type for cases where the iterator they return has 0 elements, but probably that use should be replaced with Base.infer_return_type

If Core.Compiler.return_type is not safe, but map and list comprehensions use it for empty collections, then perhaps we should try to fix it within Julia?

in the case of DD, the behavior changes based on the results of type inference significantly, such as printing or raising exceptions, leading to the worst-case scenario of a segfault.

I think this is also true more broadly, rather than just DD. I would assume that empty collections generated by map are quite common patterns, no? If creating empty collections can cause segfaults it seems we should fix this in Base rather than just the DD library.

MilesCranmer commented 3 months ago

It's even more prevalent because the length of a collection could be based on the value rather than the type, meaning that Core.Compiler.return_type always appears in a compiled branch of map and list comprehensions for Vector, right?

aviatesk commented 3 months ago

Please recognize that the risks inherent in list comprehensions and map are of an entirely different dimension from the risks associated with DD. List comprehensions and map themselves do not cause dangerous events like segfaults. Such events can occur when a package makes some bad assumptions about the types returned from Core.Compiler.return_type (including types of the returned values of these basic functions), and builds its codebase on those assumptions. DD is an extreme case where such assumptions are made, and it is causing actual segfaults, so it needs to be fixed. But it doesn't mean we should stop using map and list comprehension entirely since we can use them safely if we don't make such assumptions.

Just to be sure, I confirmed that the issue is not with any by applying the following patch to DD. The same segfault occurs with DD using the patch below, indicating that the problem lies in DD's use of Core.Compiler.return_type, and the relevant Julia base commit only brought it to light:

diff --git a/src/stabilization.jl b/src/stabilization.jl
index fdfb60f..310fe7d 100644
--- a/src/stabilization.jl
+++ b/src/stabilization.jl
@@ -115,7 +115,7 @@ function _stabilize_all(ex::Expr, downward_metadata::DownwardMetadata; kws...)
                 @assert length(upward_metadata.unused_macros) == length(upward_metadata.macro_keys)

                 new_ex = Expr(:macrocall, upward_metadata.unused_macros[end]..., inner_ex)
-                new_upward_metadata = UpwardMetadata(; 
+                new_upward_metadata = UpwardMetadata(;
                     matching_function = upward_metadata.matching_function,
                     unused_macros = upward_metadata.unused_macros[1:end-1],
                     macro_keys = upward_metadata.macro_keys[1:end-1],
diff --git a/src/utils.jl b/src/utils.jl
index 197d0be..19d2b7b 100644
--- a/src/utils.jl
+++ b/src/utils.jl
@@ -97,7 +97,35 @@ specializing_typeof(::Type{T}) where {T} = Type{T}
 specializing_typeof(::Val{T}) where {T} = Val{T}
 map_specializing_typeof(args...) = map(specializing_typeof, args)

-_promote_op(f, S::Type...) = Base.promote_op(f, S...)
+# `promote_op` without any usage of `any`
+let ex = Expr(:block)
+    function gen_case_n(n)
+        blk = Expr(:block)
+        ret = :(Tuple{})
+        for i = 1:n
+            push!(blk.args, :(args[$i] === Union{} && return Union{}))
+            push!(ret.args, :(args[$i]))
+        end
+        return :(if length(args) == $n
+            $blk
+            return $ret
+        end)
+    end
+    for i = 0:15
+        push!(ex.args, gen_case_n(i))
+    end
+    @eval function TupleOrBottom_without_any(args...)
+        $ex
+    end
+end
+
+function promote_op_without_any(f, S::Type...)
+    argT = TupleOrBottom_without_any(S...)
+    argT === Union{} && return Union{}
+    return Core.Compiler.return_type(f, argT)
+end
+
+_promote_op(f, S::Type...) = promote_op_without_any(f, S...)
 _promote_op(f, S::Tuple) = _promote_op(f, S...)
 @static if isdefined(Core, :kwcall)
     function _promote_op(
@@ -120,7 +148,7 @@ return false for `Union{}`, so that errors can propagate.
 # so we implement a workaround.
 @inline type_instability(::Type{Type{T}}) where {T} = type_instability(T)

-@generated function type_instability_limit_unions(
+function type_instability_limit_unions(
     ::Type{T}, ::Val{union_limit}
 ) where {T,union_limit}
     if T isa UnionAll
MilesCranmer commented 3 months ago

But it doesn't mean we should stop using map and list comprehension entirely since we can use them safely if we don't make such assumptions.

What assumptions do you mean? Is it that DD dispatches on the return value of Base.promote_op? Isn't dispatching on the return type of a map or list comprehension the exact same thing?

For example, will the following code potentially lead to a segfault?

g() = [i for i in 1:0]
h() = g() isa Vector{Int}
@generated function foo()
    if h()
        return :(1)
    else
        return :(2)
    end
end
aviatesk commented 3 months ago

I don't know the exact mechanism causing the segfault in DD. Your case is one example of making a bad assumption. This code itself may not cause a segfault, but since the value returned by foo is not defined, it could potentially lead to some dangerous events depending on its use case.

MilesCranmer commented 3 months ago

Consider that I could just rewrite DispatchDoctor to use

x = []
T = eltype(map(_ -> f(args...), x))
# equivalent to: T = Base.promote_op(f, typeof.(args)...)

I'm not doing anything fancy here, this type of code could be found in anybody's library. No Julia internals. Yet, this code implicitly is using Base.promote_op.

This type of code is even used within the Julia standard library:

https://github.com/JuliaLang/julia/blob/df1976d500914bebd5651bf56c13153f49e153f9/stdlib/LinearAlgebra/src/dense.jl#L344

or

https://github.com/JuliaSparse/SparseArrays.jl/blob/e61663ad0a79a48906b0b12d53506e731a614ab8/src/sparsematrix.jl#L4030

or

https://github.com/JuliaSparse/SparseArrays.jl/blob/e61663ad0a79a48906b0b12d53506e731a614ab8/src/sparsematrix.jl#L4205

If it is a bad assumption that any use of map or list comprehension should not be used by any code referenced @generated functions, I don't think this is well known? Basically I think Core.Compiler.return_type should be fixed.

Not to mention there are 1,400+ files which make explicit uses of that symbol on GitHub: https://github.com/search?q=/Core.Compiler.return_type/+language:julia&type=code. I've already added a workaround from the main issue on 1.11.0 but it seems like it could be causing the other similar issues like https://github.com/JuliaLang/julia/issues/53847, https://github.com/JuliaLang/julia/issues/53843, https://github.com/JuliaLang/julia/issues/53848, https://github.com/JuliaLang/julia/issues/53761, ... which might be due to a similar phenomena. This is the certainly one we've drilled into the reasons behind, but maybe the others are caused by the same Julia bug.

MilesCranmer commented 3 months ago

I guess this goes back to your earlier comment:

List comprehensions and map reluctantly use Core.Compiler.return_type for cases where the iterator they return has 0 elements, but probably that use should be replaced with Base.infer_return_type (although this might not be possible from a performance perspective). Having said that they only change the type of the iterator they return based on type inference, so the impact of Core.Compiler.return_type unreliability is minimal. Unfortunately, in the case of DD, the behavior changes based on the results of type inference significantly, such as printing or raising exceptions, leading to the worst-case scenario of a segfault. When Core.Compiler.return_type returns different results, it may hit branches that should not be executed or vice versa.

I think "unreliability is minimal" is pulling a lot of weight here. I think any unreliability should be fixed at the source, rather than shooting the messenger. The comment

behavior changes based on the results of type inference significantly, such as printing or raising exceptions, leading to the worst-case scenario of a segfault

DD's usage feels like it's actually very minimal changes in behavior, no? DD is just turning on/off logging from the result of the return type inference. The usage of Base.promote_op won't actually change the output of a calculation, for example. All bugs flagged by DD are loud and easily detected. If anything, DD is probably minimally affected by bugs in Core.Compiler.return_type – it just calls it frequently enough so that major issues like segfaults pop up more frequently.

It's almost like the phenomena caused by DD here is an "rr chaos mode" of the type inference system. If a bug shows up in rr chaos mode, it's still a bug. Similarly, if a segfault shows up from DD's usage of promote_op, it's still a segfault that should be patched in Base.

vtjnash commented 3 months ago

Looking closer into this, it appears this segfaults because the DynamicExpressions module defined the return type of the simulator of any to depend on its own Core.Compiler.return_type result, which is non-computable by inference, but inference attempts to infer it, and we have too many try/catch statements inside the Compiler so it quickly loses track of what went wrong after this hits a StackOverflow and starts to instead infer nonsense for promote_op

Tuple{DynamicExpressions.NodeModule.var"###any_simulator#430", DynamicExpressions.NodeModule.var"#75#77"{DynamicExpressions.NodeUtilsModule.var"#13#15"}, DynamicExpressions.NodeModule.Node{Float32}}

I haven't been able to make an MWE that simulates this however

MilesCranmer commented 3 months ago

Very interesting, thanks for the note.

Maybe a MWE of this inference stack overflow would be the following?

foo() = [foo() for _ in 1:0]

Or would this not replicate the issue? (On mobile so can’t check)