Closed MilesCranmer closed 3 months ago
@KristofferC I mean of course I agree with the general sentiment but some of the code it's crashing while precompiling involve in(::Any, ::Tuple)
calls so it's at least plausible.
@aviatesk I still see the bug. Can you provide details on your system? And to confirm, you saw the Illegal instruction
, rather than a different bug? Was it the same stacktrace?
Yes, I am seeing exactly the same segfault, but the patch I applied yesterday wrongly seems to have eliminated the segfault due to the order of precompilation (specifically, changes in Preferences significantly alter DD's behavior, but preference changes do not necessarily trigger re-precompilation). If I clear the precompilation cache manually, the same error occurs even with both patches applied.
Initially, I thought it was caused by the impureness of the @generated
function, but it seems that is not the case because the issue still persists even if we make DD-functionality unreachable from any of the generators.
The problem is probably very subtle and due to the instability of Core.Compiler.return_type
itself. Specifically, my PR enables the concrete evaluation of promote_op
, giving inference a more chance to use and cache of Core.Compiler.return_type
, which I believe is causing the issue (though I'm not entirely sure how it ends up surfacing as the segfault). I have confirmed that applying @invokelatest
to promote_op
to stop inference, or using Base.infer_return_type
, can avoid this issue.
The use of Base.promote_op
(and internally Core.Compiler.return_type
) should be avoided in the first place (as stated in the documentation of Base.promote_op
, xref), especially in packages like DD that change its behavior based on type inference results. Therefore, the best solution might be for DD to stop using Base.promote_op
and switch to using Base.infer_return_type
.
This particular issue may not be related to @generated
functions (though it is certainly a good idea to avoid calling type inference reflection from generators).
The problem is likely the use of Base.promote_op
itself, and applying the following patch can suppress the segfault while maintaining DD's behavior:
diff --git a/src/stabilization.jl b/src/stabilization.jl
index fdfb60f..310fe7d 100644
--- a/src/stabilization.jl
+++ b/src/stabilization.jl
@@ -115,7 +115,7 @@ function _stabilize_all(ex::Expr, downward_metadata::DownwardMetadata; kws...)
@assert length(upward_metadata.unused_macros) == length(upward_metadata.macro_keys)
new_ex = Expr(:macrocall, upward_metadata.unused_macros[end]..., inner_ex)
- new_upward_metadata = UpwardMetadata(;
+ new_upward_metadata = UpwardMetadata(;
matching_function = upward_metadata.matching_function,
unused_macros = upward_metadata.unused_macros[1:end-1],
macro_keys = upward_metadata.macro_keys[1:end-1],
diff --git a/src/utils.jl b/src/utils.jl
index 197d0be..66fb235 100644
--- a/src/utils.jl
+++ b/src/utils.jl
@@ -97,7 +97,7 @@ specializing_typeof(::Type{T}) where {T} = Type{T}
specializing_typeof(::Val{T}) where {T} = Val{T}
map_specializing_typeof(args...) = map(specializing_typeof, args)
-_promote_op(f, S::Type...) = Base.promote_op(f, S...)
+_promote_op(f, S::Type...) = Base.infer_return_type(f, S)
_promote_op(f, S::Tuple) = _promote_op(f, S...)
@static if isdefined(Core, :kwcall)
function _promote_op(
@@ -120,7 +120,7 @@ return false for `Union{}`, so that errors can propagate.
# so we implement a workaround.
@inline type_instability(::Type{Type{T}}) where {T} = type_instability(T)
-@generated function type_instability_limit_unions(
+function type_instability_limit_unions(
::Type{T}, ::Val{union_limit}
) where {T,union_limit}
if T isa UnionAll
While we ideally need to improve the reliability of Core.Compiler.return_type
, it is a rather complex and difficult problem (so it still remains to be buggy). In the meantime, it would be better to avoid its use on the DD side to circumvent the issue.
Thanks for all of this, I appreciate it.
So, looking into this alternative, one issue with Base.infer_return_type
is that it is not known at compile time, meaning that LLVM can no longer strip the instability check for stable functions:
julia> using DispatchDoctor # With the patch in https://github.com/JuliaLang/julia/issues/55147#issuecomment-2262472199
julia> @stable f(x) = x
f (generic function with 1 method)
julia> @code_llvm f(1)
; Function Signature: f(Int64)
; @ /Users/mcranmer/PermaDocuments/DispatchDoctor.jl/src/stabilization.jl:301 within `f`
define i64 @julia_f_6223(i64 signext %"x::Int64") #0 {
top:
%jlcallframe1 = alloca [6 x ptr], align 8
%gcframe2 = alloca [8 x ptr], align 16
call void @llvm.memset.p0.i64(ptr align 16 %gcframe2, i8 0, i64 64, i1 true)
%0 = getelementptr inbounds ptr, ptr %gcframe2, i64 2
%"new::OptimizationParams" = alloca { i8, i64, i64, i64, i64, i64, i8, i8, i8 }, align 8
%1 = alloca { i64, { ptr, [1 x i64] }, ptr, { i64, i64, i64, i64, i64, i8, i8, i8, i8, i8 }, { i8, i64, i64, i64, i64, i64, i8, i8, i8 } }, align 8
%pgcstack = call ptr inttoptr (i64 6490275500 to ptr)(i64 261) #5
store i64 24, ptr %gcframe2, align 16
%task.gcstack = load ptr, ptr %pgcstack, align 8
%frame.prev = getelementptr inbounds ptr, ptr %gcframe2, i64 1
store ptr %task.gcstack, ptr %frame.prev, align 8
store ptr %gcframe2, ptr %pgcstack, align 8
; ┌ @ /Users/mcranmer/PermaDocuments/DispatchDoctor.jl/src/utils.jl:101 within `_promote_op` @ /Users/mcranmer/PermaDocuments/DispatchDoctor.jl/src/utils.jl:100
; │┌ @ reflection.jl:1872 within `infer_return_type`
; ││┌ @ reflection.jl:2586 within `get_world_counter`
%2 = call i64 @jlplt_ijl_get_world_counter_6226_got.jit()
; ││└
; ││┌ @ compiler/types.jl:373 within `NativeInterpreter`
; │││┌ @ compiler/types.jl:321 within `OptimizationParams`
; ││││┌ @ compiler/utilities.jl:506 within `inlining_enabled`
; │││││┌ @ options.jl:68 within `JLOptions`
; ││││││┌ @ pointer.jl:153 within `unsafe_load` @ pointer.jl:153
%pointerref.sroa.1.0.copyload = load i8, ptr getelementptr inbounds (i8, ptr @jl_options.found.jit, i64 110), align 2
; │││││└└
; │││││┌ @ promotion.jl:483 within `==` @ promotion.jl:639
%3 = icmp eq i8 %pointerref.sroa.1.0.copyload, 1
; ││││└└
; ││││ @ compiler/types.jl:321 within `OptimizationParams` @ compiler/types.jl:321
; ││││┌ @ compiler/types.jl:341 within `#OptimizationParams#314`
; │││││┌ @ compiler/types.jl:309 within `OptimizationParams`
%4 = zext i1 %3 to i8
store i8 %4, ptr %"new::OptimizationParams", align 8
%5 = getelementptr inbounds { i8, i64, i64, i64, i64, i64, i8, i8, i8 }, ptr %"new::OptimizationParams", i64 0, i32 1
store <2 x i64> <i64 100, i64 1000>, ptr %5, align 8
%6 = getelementptr inbounds { i8, i64, i64, i64, i64, i64, i8, i8, i8 }, ptr %"new::OptimizationParams", i64 0, i32 3
store <2 x i64> <i64 250, i64 20>, ptr %6, align 8
%7 = getelementptr inbounds { i8, i64, i64, i64, i64, i64, i8, i8, i8 }, ptr %"new::OptimizationParams", i64 0, i32 5
store i64 32, ptr %7, align 8
%8 = getelementptr inbounds { i8, i64, i64, i64, i64, i64, i8, i8, i8 }, ptr %"new::OptimizationParams", i64 0, i32 6
store i8 1, ptr %8, align 8
%9 = getelementptr inbounds { i8, i64, i64, i64, i64, i64, i8, i8, i8 }, ptr %"new::OptimizationParams", i64 0, i32 7
store i8 0, ptr %9, align 1
%10 = getelementptr inbounds { i8, i64, i64, i64, i64, i64, i8, i8, i8 }, ptr %"new::OptimizationParams", i64 0, i32 8
store i8 0, ptr %10, align 2
; │││└└└
call void @"j_#NativeInterpreter#315_6236"(ptr noalias nocapture noundef nonnull sret({ i64, { ptr, [1 x i64] }, ptr, { i64, i64, i64, i64, i64, i8, i8, i8, i8, i8 }, { i8, i64, i64, i64, i64, i64, i8, i8, i8 } }) %1, ptr noalias nocapture noundef nonnull %0, ptr nocapture nonnull readonly @"_j_const#10", ptr nocapture nonnull readonly %"new::OptimizationParams", i64 zeroext %2)
; ││└
%11 = call nonnull ptr @"j_#infer_return_type#34_6238"(i64 zeroext %2, ptr nocapture nonnull readonly %1, ptr nonnull readonly @"jl_global#6239.jit", ptr nonnull readonly @"jl_global#6240.jit")
; └└
; ... (truncated)
Whereas with the current unpatched use of Base.promote_op
, the instability check can be removed by the compiler:
julia> @code_llvm f(1)
; @ /Users/mcranmer/PermaDocuments/DispatchDoctor.jl/src/stabilization.jl:301 within `f`
define i64 @julia_f_1198(i64 signext %0) #0 {
top:
; @ /Users/mcranmer/PermaDocuments/DispatchDoctor.jl/src/stabilization.jl:306 within `f`
ret i64 %0
}
Ideally I would like for DispatchDoctor to use whatever list comprehensions and map(f, x)
use for type inference, as those give types known at compile time. My assumption was that this is what Base.promote_op
was used for. Do those methods use something else?
Also, let me know if I can do anything to help with debugging Core.Compiler.return_type
.
Ideally I would like for DispatchDoctor to use whatever list comprehensions and
map(f, x)
use for type inference, as those give types known at compile time. My assumption was that this is whatBase.promote_op
was used for.
Well, this is exactly why Core.Compiler.return_type
is buggy and its use should be avoided if possible. Maybe reading through the documentation of promote_op
or the discussion at https://github.com/JuliaLang/julia/pull/44340 may clarify it.
It is possible to debug Core.Compiler.return_type
, but this is a very subtle issue, and debugging it is quite a laborious task. I think the problem might be due to the difference between the world that Core.Compiler.return_type
sees and the actual runtime world, but I am not fully sure. This issue might only occur during precompilation.
List comprehensions and map
reluctantly use Core.Compiler.return_type
for cases where the iterator they return has 0 elements, but probably that use should be replaced with Base.infer_return_type
(although this might not be possible from a performance perspective). Having said that they only change the type of the iterator they return based on type inference, so the impact of Core.Compiler.return_type
unreliability is minimal. Unfortunately, in the case of DD, the behavior changes based on the results of type inference significantly, such as printing or raising exceptions, leading to the worst-case scenario of a segfault. When Core.Compiler.return_type
returns different results, it may hit branches that should not be executed or vice versa.
Here's an example of what I mean with map
:
julia> function fake_promote_op(f::F, args...) where {F}
x = map(_ -> f(args...), 1:0)
return eltype(x)
end
Note that this never actually calls f(args...)
. All it does is type inference, because the 1:0
is empty.
However, this is completely type stable, known at compile time, and is correct:
julia> fake_promote_op(+, 1.0)
Float64
julia> @code_typed fake_promote_op(+, 1.0)
CodeInfo(
1 ─ return Float64
) => Type{Float64}
julia> @code_llvm fake_promote_op(+, 1.0)
; @ REPL[11]:1 within `fake_promote_op`
define nonnull {}* @julia_fake_promote_op_1386(double %0) #0 {
top:
; @ REPL[11]:3 within `fake_promote_op`
ret {}* inttoptr (i64 4823715024 to {}*)
}
Should I use this fake_promote_op
instead of Base.promote_op
?
List comprehensions and map reluctantly use
Core.Compiler.return_type
for cases where the iterator they return has 0 elements, but probably that use should be replaced withBase.infer_return_type
If Core.Compiler.return_type
is not safe, but map
and list comprehensions use it for empty collections, then perhaps we should try to fix it within Julia?
in the case of DD, the behavior changes based on the results of type inference significantly, such as printing or raising exceptions, leading to the worst-case scenario of a segfault.
I think this is also true more broadly, rather than just DD. I would assume that empty collections generated by map
are quite common patterns, no? If creating empty collections can cause segfaults it seems we should fix this in Base rather than just the DD library.
It's even more prevalent because the length of a collection could be based on the value rather than the type, meaning that Core.Compiler.return_type
always appears in a compiled branch of map
and list comprehensions for Vector
, right?
Please recognize that the risks inherent in list comprehensions and map
are of an entirely different dimension from the risks associated with DD.
List comprehensions and map
themselves do not cause dangerous events like segfaults. Such events can occur when a package makes some bad assumptions about the types returned from Core.Compiler.return_type
(including types of the returned values of these basic functions), and builds its codebase on those assumptions. DD is an extreme case where such assumptions are made, and it is causing actual segfaults, so it needs to be fixed. But it doesn't mean we should stop using map
and list comprehension entirely since we can use them safely if we don't make such assumptions.
Just to be sure, I confirmed that the issue is not with any
by applying the following patch to DD. The same segfault occurs with DD using the patch below, indicating that the problem lies in DD's use of Core.Compiler.return_type
, and the relevant Julia base commit only brought it to light:
diff --git a/src/stabilization.jl b/src/stabilization.jl
index fdfb60f..310fe7d 100644
--- a/src/stabilization.jl
+++ b/src/stabilization.jl
@@ -115,7 +115,7 @@ function _stabilize_all(ex::Expr, downward_metadata::DownwardMetadata; kws...)
@assert length(upward_metadata.unused_macros) == length(upward_metadata.macro_keys)
new_ex = Expr(:macrocall, upward_metadata.unused_macros[end]..., inner_ex)
- new_upward_metadata = UpwardMetadata(;
+ new_upward_metadata = UpwardMetadata(;
matching_function = upward_metadata.matching_function,
unused_macros = upward_metadata.unused_macros[1:end-1],
macro_keys = upward_metadata.macro_keys[1:end-1],
diff --git a/src/utils.jl b/src/utils.jl
index 197d0be..19d2b7b 100644
--- a/src/utils.jl
+++ b/src/utils.jl
@@ -97,7 +97,35 @@ specializing_typeof(::Type{T}) where {T} = Type{T}
specializing_typeof(::Val{T}) where {T} = Val{T}
map_specializing_typeof(args...) = map(specializing_typeof, args)
-_promote_op(f, S::Type...) = Base.promote_op(f, S...)
+# `promote_op` without any usage of `any`
+let ex = Expr(:block)
+ function gen_case_n(n)
+ blk = Expr(:block)
+ ret = :(Tuple{})
+ for i = 1:n
+ push!(blk.args, :(args[$i] === Union{} && return Union{}))
+ push!(ret.args, :(args[$i]))
+ end
+ return :(if length(args) == $n
+ $blk
+ return $ret
+ end)
+ end
+ for i = 0:15
+ push!(ex.args, gen_case_n(i))
+ end
+ @eval function TupleOrBottom_without_any(args...)
+ $ex
+ end
+end
+
+function promote_op_without_any(f, S::Type...)
+ argT = TupleOrBottom_without_any(S...)
+ argT === Union{} && return Union{}
+ return Core.Compiler.return_type(f, argT)
+end
+
+_promote_op(f, S::Type...) = promote_op_without_any(f, S...)
_promote_op(f, S::Tuple) = _promote_op(f, S...)
@static if isdefined(Core, :kwcall)
function _promote_op(
@@ -120,7 +148,7 @@ return false for `Union{}`, so that errors can propagate.
# so we implement a workaround.
@inline type_instability(::Type{Type{T}}) where {T} = type_instability(T)
-@generated function type_instability_limit_unions(
+function type_instability_limit_unions(
::Type{T}, ::Val{union_limit}
) where {T,union_limit}
if T isa UnionAll
But it doesn't mean we should stop using map and list comprehension entirely since we can use them safely if we don't make such assumptions.
What assumptions do you mean? Is it that DD dispatches on the return value of Base.promote_op
? Isn't dispatching on the return type of a map
or list comprehension the exact same thing?
For example, will the following code potentially lead to a segfault?
g() = [i for i in 1:0]
h() = g() isa Vector{Int}
@generated function foo()
if h()
return :(1)
else
return :(2)
end
end
I don't know the exact mechanism causing the segfault in DD. Your case is one example of making a bad assumption. This code itself may not cause a segfault, but since the value returned by foo is not defined, it could potentially lead to some dangerous events depending on its use case.
Consider that I could just rewrite DispatchDoctor to use
x = []
T = eltype(map(_ -> f(args...), x))
# equivalent to: T = Base.promote_op(f, typeof.(args)...)
I'm not doing anything fancy here, this type of code could be found in anybody's library. No Julia internals. Yet, this code implicitly is using Base.promote_op
.
This type of code is even used within the Julia standard library:
or
or
If it is a bad assumption that any use of map
or list comprehension should not be used by any code referenced @generated
functions, I don't think this is well known? Basically I think Core.Compiler.return_type
should be fixed.
Not to mention there are 1,400+ files which make explicit uses of that symbol on GitHub: https://github.com/search?q=/Core.Compiler.return_type/+language:julia&type=code. I've already added a workaround from the main issue on 1.11.0 but it seems like it could be causing the other similar issues like https://github.com/JuliaLang/julia/issues/53847, https://github.com/JuliaLang/julia/issues/53843, https://github.com/JuliaLang/julia/issues/53848, https://github.com/JuliaLang/julia/issues/53761, ... which might be due to a similar phenomena. This is the certainly one we've drilled into the reasons behind, but maybe the others are caused by the same Julia bug.
I guess this goes back to your earlier comment:
List comprehensions and
map
reluctantly useCore.Compiler.return_type
for cases where the iterator they return has 0 elements, but probably that use should be replaced withBase.infer_return_type
(although this might not be possible from a performance perspective). Having said that they only change the type of the iterator they return based on type inference, so the impact ofCore.Compiler.return_type
unreliability is minimal. Unfortunately, in the case of DD, the behavior changes based on the results of type inference significantly, such as printing or raising exceptions, leading to the worst-case scenario of a segfault. WhenCore.Compiler.return_type
returns different results, it may hit branches that should not be executed or vice versa.
I think "unreliability is minimal" is pulling a lot of weight here. I think any unreliability should be fixed at the source, rather than shooting the messenger. The comment
behavior changes based on the results of type inference significantly, such as printing or raising exceptions, leading to the worst-case scenario of a segfault
DD's usage feels like it's actually very minimal changes in behavior, no? DD is just turning on/off logging from the result of the return type inference. The usage of Base.promote_op
won't actually change the output of a calculation, for example. All bugs flagged by DD are loud and easily detected. If anything, DD is probably minimally affected by bugs in Core.Compiler.return_type
– it just calls it frequently enough so that major issues like segfaults pop up more frequently.
It's almost like the phenomena caused by DD here is an "rr chaos mode" of the type inference system. If a bug shows up in rr chaos mode, it's still a bug. Similarly, if a segfault shows up from DD's usage of promote_op, it's still a segfault that should be patched in Base.
Looking closer into this, it appears this segfaults because the DynamicExpressions module defined the return type of the simulator of any
to depend on its own Core.Compiler.return_type
result, which is non-computable by inference, but inference attempts to infer it, and we have too many try/catch statements inside the Compiler so it quickly loses track of what went wrong after this hits a StackOverflow and starts to instead infer nonsense for promote_op
Tuple{DynamicExpressions.NodeModule.var"###any_simulator#430", DynamicExpressions.NodeModule.var"#75#77"{DynamicExpressions.NodeUtilsModule.var"#13#15"}, DynamicExpressions.NodeModule.Node{Float32}}
I haven't been able to make an MWE that simulates this however
Very interesting, thanks for the note.
Maybe a MWE of this inference stack overflow would be the following?
foo() = [foo() for _ in 1:0]
Or would this not replicate the issue? (On mobile so can’t check)
I'm seeing some precompilation crashes on Julia 1.11-rc1 when precompiling DynamicExpressions.jl with DispatchDoctor in-use on the package. (DispatchDoctor.jl is basically a package that calls
promote_op
on each function and uses that to flag type instabilities.)Here is the traceback:
versioninfo:
I installed Julia with juliaup. To reproduce this issue, you can run the following code:
I can prevent this error with the following PR on DispatchDoctor.jl: https://github.com/MilesCranmer/DispatchDoctor.jl/compare/094b1651eeef3fb2017be46a48f0da13724e1123~...b223a4d033dd3d17901879871c508ae33cfd550a. The PR basically amounts to changing some functions into
@generated
form:However, it doesn't seem like DispatchDoctor.jl or DynamicExpressions.jl is doing anything wrong, so I'm not sure what's going on. Both before and after seem to be valid Julia code. Also, the downside of that PR is it introduces a type instability in Zygote autodiff, and there doesn't seem to be a way around it that both prevents the segfault while also eliminating the type instability.
I don't understand the conditions for reproducing this, so this is so far my only example. When I make various tweaks to
_promote_op
within DispatchDoctor.jl, I seem to end up with different segfaults – one of which is theUnreachable reached
bug.cc @avik-pal