Closed GunnarFarneback closed 2 years ago
Minimized from
using Colors, BenchmarkTools
x1 = fill(XYZ{Float32}(0,0,0), 1000, 1000);
x2 = fill(XYZ{Float32}(1,0,0), 1000, 1000);
@btime convert.(Lab, $x1);
@btime convert.(Lab, $x2);
As mentioned in https://github.com/JuliaGraphics/Colors.jl/issues/425, this problem is more serious than the impression given by the MWE.
In the case of MWE,
f()
will be eventually recompiled becausef()
is in a "shallow" place. However, in the Colors.jl,cbrt
is in "deep" places from the public API (e.g.colordiff
anddistinguishable_colors
), so it is difficult to avoid the problem.
This problem seems to have nothing to do with type instability. The problem occurs even if the concrete type is specified explicitly.
As far as I can tell, not only cbrt
but also exp
has the same problem. ~Their common point is Union{Float32, Float64}
. The problem doesn't occur with log
, which has separate log(::Float32)
and log(::Float64)
.~
The difference in the likelihood of problems depending on the type of function may be due to the difference in the cost of inlining. If so, this could be a variation of https://github.com/JuliaLang/julia/issues/34055.
IIRC, in the typical cases, this could be avoided by not running precompile()
.
However, at least in Julia v1.6.1, there seems to be a case where __precompile__(false)
is required.
There is a difference in whether the trigger is a precompilation or a JIT compilation, but the phenomenon is similar to #35537.
FYI, here is an MWE that shows that this is related to inlining.
module Issue35972
lg20(x) = x > 0.0 ? -f20(1.0 - x) : 0.0
lg21(x) = x > 0.0 ? -f21(1.0 - x) : 0.0
# inline-worthy
f20(x) = @evalpoly(x,
0.0, 1/1, 1/2, 1/3, 1/4, 1/5, 1/6, 1/7, 1/8, 1/9,
1/10, 1/11, 1/12, 1/13, 1/14, 1/15, 1/16, 1/17, 1/18, 1/19,
1/20)
# not inline-worthy
f21(x) = @evalpoly(x,
0.0, 1/1, 1/2, 1/3, 1/4, 1/5, 1/6, 1/7, 1/8, 1/9,
1/10, 1/11, 1/12, 1/13, 1/14, 1/15, 1/16, 1/17, 1/18, 1/19,
1/20, 1/21)
precompile(lg20, (Float64,))
precompile(lg21, (Float64,))
end # module
julia> using Issue35972
julia> x = rand(1_000_000);
julia> Issue35972.lg20.(x);
julia> Issue35972.lg21.(x);
julia> @time Issue35972.lg20.(x);
0.004678 seconds (4 allocations: 7.629 MiB)
julia> @time Issue35972.lg21.(x);
0.022517 seconds (2.00 M allocations: 38.147 MiB)
julia> versioninfo()
Julia Version 1.8.0-DEV.333
Commit c12e63fba3 (2021-08-11 02:09 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)
In the demo in the OP (and every other one I've tried), running the second @btime
under @profile
and then viewing the results with Profile(; C=true)
contains quite a few sections that contain [unknown stackframe]
:
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 33 @Base/broadcast.jl:670; _broadcast_getindex_evalf
5╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 33 @M/src/M.jl:3; f
2╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 2 /home/tim/src/julia-master/cli/loader_exe.c:14; jl_get_pgcstack_static
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 1 /home/tim/src/julia-master/src/gc.c:1205; jl_gc_pool_alloc
1╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 1 /home/tim/src/julia-master/src/gc.c:880; maybe_collect
1╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 1 /home/tim/src/julia-master/src/gc.c:1229; jl_gc_pool_alloc
1╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 1 /home/tim/src/julia-master/src/gc.c:1246; jl_gc_pool_alloc
1╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 1 /home/tim/src/julia-master/src/gf.c:2250; jl_invoke
1╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 1 /home/tim/src/julia-master/src/gf.c:2251; jl_invoke
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 1 /home/tim/src/julia-master/src/gf.c:2252; jl_invoke
1╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 1 /home/tim/src/julia-master/src/gf.c:2224; _jl_invoke
1╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 20 [unknown stackframe]
2╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 2 @Base/special/cbrt.jl:0; cbrt(x::Float64)
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 2 @Base/special/cbrt.jl:146; cbrt(x::Float64)
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 2 @Base/special/cbrt.jl:68; _approx_cbrt
1╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 1 @Base/int.jl:87; +
1╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 1 @Base/int.jl:280; div
Also worth noting that the line number for the statement after the [unknown_stackframe]
is 0.
This is also interesting:
julia> using M, BenchmarkTools, Profile
julia> @code_llvm M.f(3.2)
; @ /tmp/pkgs/M/src/M.jl:3 within `f`
define double @julia_f_273(double %0) #0 {
top:
%1 = alloca {}*, align 8
%gcframe7 = alloca [3 x {}*], align 16
%gcframe7.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 0
%2 = bitcast [3 x {}*]* %gcframe7 to i8*
call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(24) %2, i8 0, i32 24, i1 false)
%thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #6
%ppgcstack_i8 = getelementptr i8, i8* %thread_ptr, i64 -8
%ppgcstack = bitcast i8* %ppgcstack_i8 to {}****
%pgcstack = load {}***, {}**** %ppgcstack, align 8
; ┌ @ operators.jl:378 within `>`
; │┌ @ float.jl:490 within `<` @ float.jl:444
%3 = bitcast [3 x {}*]* %gcframe7 to i64*
store i64 4, i64* %3, align 16
%4 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 1
%5 = bitcast {}** %4 to {}***
%6 = load {}**, {}*** %pgcstack, align 8
store {}** %6, {}*** %5, align 8
%7 = bitcast {}*** %pgcstack to {}***
store {}** %gcframe7.sub, {}*** %7, align 8
%8 = fcmp ule double %0, 0.000000e+00
; └└
br i1 %8, label %L8, label %L6
L6: ; preds = %top
%ptls_field3 = getelementptr inbounds {}**, {}*** %pgcstack, i64 2305843009213693954
%9 = bitcast {}*** %ptls_field3 to i8**
%ptls_load45 = load i8*, i8** %9, align 8
%10 = call noalias nonnull {}* @jl_gc_pool_alloc(i8* %ptls_load45, i32 1392, i32 16) #2
%11 = bitcast {}* %10 to i64*
%12 = getelementptr inbounds i64, i64* %11, i64 -1
store atomic i64 140704265234768, i64* %12 unordered, align 8
%13 = bitcast {}* %10 to double*
store double %0, double* %13, align 8
%14 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 2
store {}* %10, {}** %14, align 16
store {}* %10, {}** %1, align 8
%15 = call nonnull {}* @jl_invoke({}* inttoptr (i64 140704310911760 to {}*), {}** nonnull %1, i32 1, {}* inttoptr (i64 140704435100720 to {}*))
%16 = bitcast {}* %15 to double*
%17 = load double, double* %16, align 8
%18 = load {}*, {}** %4, align 8
%19 = bitcast {}*** %pgcstack to {}**
store {}* %18, {}** %19, align 8
ret double %17
L8: ; preds = %top
%20 = load {}*, {}** %4, align 8
%21 = bitcast {}*** %pgcstack to {}**
store {}* %20, {}** %21, align 8
ret double 0.000000e+00
}
julia> x2 = fill(1.0, 1000000);
julia> @btime M.f.($x2);
24.999 ms (2000002 allocations: 38.15 MiB)
julia> @code_llvm M.f(3.2)
; @ /tmp/pkgs/M/src/M.jl:3 within `f`
define double @julia_f_457(double %0) #0 {
top:
; ┌ @ operators.jl:378 within `>`
; │┌ @ float.jl:490 within `<` @ float.jl:444
%1 = fcmp ule double %0, 0.000000e+00
; └└
br i1 %1, label %L8, label %L6
L6: ; preds = %top
%2 = call double @j_cbrt_459(double %0) #0
ret double %2
L8: ; preds = %top
ret double 0.000000e+00
}
Changing that first @code_llvm
to @code_typed
gives the good performance/codegen, regardless of whether one sets optimize=true
or optimize=false
.
Partial progress: the difference seems to come down to the fact that we get to https://github.com/JuliaLang/julia/blob/eb83c4d25b9c4a3696c08d2b5c7debfa9cc1a5ae/src/codegen.cpp#L3696-L3699 with handled = false
when it's loaded from the cache, but with handled = true
when we define the module in-session. This is because the CodeInstance isn't cached (it's not internal
) and so https://github.com/JuliaLang/julia/blob/eb83c4d25b9c4a3696c08d2b5c7debfa9cc1a5ae/src/codegen.cpp#L3654 returns jl_nothing
.
And more progress. With this diff that puts debugging statements in both codegen and type-inference:
diff --git a/base/compiler/typeinfer.jl b/base/compiler/typeinfer.jl
index a3895c380f..052c0a4baa 100644
--- a/base/compiler/typeinfer.jl
+++ b/base/compiler/typeinfer.jl
@@ -850,9 +850,12 @@ end
# compute (and cache) an inferred AST and return type
function typeinf_ext(interp::AbstractInterpreter, mi::MethodInstance)
method = mi.def::Method
+ println(mi)
for i = 1:2 # test-and-lock-and-test
i == 2 && ccall(:jl_typeinf_begin, Cvoid, ())
code = get(code_cache(interp), mi, nothing)
+ println(i)
+ println(code)
if code isa CodeInstance
# see if this code already exists in the cache
inf = code.inferred
@@ -895,14 +898,22 @@ function typeinf_ext(interp::AbstractInterpreter, mi::MethodInstance)
end
end
if ccall(:jl_get_module_infer, Cint, (Any,), method.module) == 0
+ println("retrieving ", mi)
return retrieve_code_info(mi)
end
+ println("inferring ", mi)
lock_mi_inference(interp, mi)
frame = InferenceState(InferenceResult(mi), #=cached=#true, interp)
frame === nothing && return nothing
typeinf(interp, frame)
ccall(:jl_typeinf_end, Cvoid, ())
frame.src.inferred || return nothing
+ print("for ", mi, ", the cache is ")
+ if isdefined(mi, :cache)
+ println(typeof(mi.cache.inferred))
+ else
+ println("not cached")
+ end
return frame.src
end
diff --git a/src/codegen.cpp b/src/codegen.cpp
index d05080ee42..022d1958bb 100644
--- a/src/codegen.cpp
+++ b/src/codegen.cpp
@@ -3622,6 +3622,7 @@ static jl_cgval_t emit_invoke(jl_codectx_t &ctx, jl_expr_t *ex, jl_value_t *rt)
bool handled = false;
jl_cgval_t result;
+ jl_(ex);
if (lival.constant) {
jl_method_instance_t *mi = (jl_method_instance_t*)lival.constant;
assert(jl_is_method_instance(mi));
@@ -3643,6 +3644,7 @@ static jl_cgval_t emit_invoke(jl_codectx_t &ctx, jl_expr_t *ex, jl_value_t *rt)
else {
jl_value_t *ci = ctx.params->lookup(mi, ctx.world, ctx.world); // TODO: need to use the right pair world here
jl_code_instance_t *codeinst = (jl_code_instance_t*)ci;
+ jl_printf(JL_STDOUT, "Is the CodeInstance lookup nothing? %d\n", ci == jl_nothing);
if (ci != jl_nothing && codeinst->invoke != jl_fptr_sparam) { // check if we know we definitely can't handle this specptr
if (codeinst->invoke == jl_fptr_const_return) {
result = mark_julia_const(codeinst->rettype_const);
with the debugging to Core.Compiler not built (hit Ctrl-C after make
ing the C code), doing the following:
using Issue35972, MethodAnalysis
cd("/home/tim/src/julia-master/base")
include("compiler/compiler.jl")
x = rand(1_000_000);
map(Issue35972.lg21, x); # map is less compile-intensive than broadcast
yields the following final section:
...
inferring collect_similar(Array{Float64, 1}, Base.Generator{Array{Float64, 1}, typeof(Issue35972.lg21)}) from collect_similar(Any, Any)
for collect_similar(Array{Float64, 1}, Base.Generator{Array{Float64, 1}, typeof(Issue35972.lg21)}) from collect_similar(Any, Any), the cache is Array{UInt8, 1}
Expr(:invoke, _collect(Array{Float64, 1}, Base.Generator{Array{Float64, 1}, typeof(Issue35972.lg21)}, Base.EltypeUnknown, Base.HasShape{1}) from _collect(Any, Any, Base.EltypeUnknown, Union{Base.HasLength, Base.HasShape{N} where N}), Base._collect, Core.Argument(n=2), Core.Argument(n=3), quote Base.EltypeUnknown() end, quote Base.HasShape{1}() end)
Is the CodeInstance lookup nothing? 0
Expr(:invoke, evalpoly(Float64, Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}) from evalpoly(Any, Tuple), Base.Math.evalpoly, SSAValue(75), (0, 1, 0.5, 0.333333, 0.25, 0.2, 0.166667, 0.142857, 0.125, 0.111111, 0.1, 0.0909091, 0.0833333, 0.0769231, 0.0714286, 0.0666667, 0.0625, 0.0588235, 0.0555556, 0.0526316, 0.05, 0.047619))
Is the CodeInstance lookup nothing? 1
Expr(:invoke, evalpoly(Float64, Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}) from evalpoly(Any, Tuple), Base.Math.evalpoly, SSAValue(24), (0, 1, 0.5, 0.333333, 0.25, 0.2, 0.166667, 0.142857, 0.125, 0.111111, 0.1, 0.0909091, 0.0833333, 0.0769231, 0.0714286, 0.0666667, 0.0625, 0.0588235, 0.0555556, 0.0526316, 0.05, 0.047619))
Is the CodeInstance lookup nothing? 1
evalpoly(Float64, Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}) from evalpoly(Any, Tuple)
1
nothing
2
nothing
inferring evalpoly(Float64, Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}) from evalpoly(Any, Tuple)
for evalpoly(Float64, Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}) from evalpoly(Any, Tuple), the cache is Array{UInt8, 1}
but then next:
mi = methodinstance(evalpoly, (Float64, NTuple{22,Float64}))
julia> mi.cache
Core.CodeInstance(MethodInstance for evalpoly(::Float64, ::NTuple{22, Float64}), #undef, 0x000000000000360b, 0xffffffffffffffff, Float64, #undef, nothing, true, true, Ptr{Nothing} @0x00007f0400263df0, Ptr{Nothing} @0x00007f0400263d50)
The inferred
field is not cached! And I'm a little surprised that codegen tries twice before executing inference, and then doesn't try again. Whereas if you define the module in-session, the final section looks like this:
inferring collect_similar(Array{Float64, 1}, Base.Generator{Array{Float64, 1}, typeof(Issue35972.lg21)}) from collect_similar(Any, Any)
for collect_similar(Array{Float64, 1}, Base.Generator{Array{Float64, 1}, typeof(Issue35972.lg21)}) from collect_similar(Any, Any), the cache is Array{UInt8, 1}
Expr(:invoke, _collect(Array{Float64, 1}, Base.Generator{Array{Float64, 1}, typeof(Issue35972.lg21)}, Base.EltypeUnknown, Base.HasShape{1}) from _collect(Any, Any, Base.EltypeUnknown, Union{Base.HasLength, Base.HasShape{N} where N}), Base._collect, Core.Argument(n=2), Core.Argument(n=3), quote Base.EltypeUnknown() end, quote Base.HasShape{1}() end)
Is the CodeInstance lookup nothing? 0
Expr(:invoke, evalpoly(Float64, Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}) from evalpoly(Any, Tuple), Base.Math.evalpoly, SSAValue(75), (0, 1, 0.5, 0.333333, 0.25, 0.2, 0.166667, 0.142857, 0.125, 0.111111, 0.1, 0.0909091, 0.0833333, 0.0769231, 0.0714286, 0.0666667, 0.0625, 0.0588235, 0.0555556, 0.0526316, 0.05, 0.047619))
Is the CodeInstance lookup nothing? 0
Expr(:invoke, evalpoly(Float64, Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}) from evalpoly(Any, Tuple), Base.Math.evalpoly, SSAValue(24), (0, 1, 0.5, 0.333333, 0.25, 0.2, 0.166667, 0.142857, 0.125, 0.111111, 0.1, 0.0909091, 0.0833333, 0.0769231, 0.0714286, 0.0666667, 0.0625, 0.0588235, 0.0555556, 0.0526316, 0.05, 0.047619))
Is the CodeInstance lookup nothing? 0
@snoopi_deep
turns out to be pretty good for debugging. From a section when you define the module in the current session:
├─ InferenceTimingNode: 0.000380/0.005951 on iterate(::Base.Generator{Vector{Float64}, typeof(Main.Issue35972.lg21)}) with 6 direct children
│ ├─ InferenceTimingNode: 0.000037/0.000037 on getproperty(::Base.Generator{Vector{Float64}, typeof(Main.Issue35972.lg21)}, ::Symbol) with 0 direct children
│ ├─ InferenceTimingNode: 0.000024/0.000024 on getproperty(::Base.Generator{Vector{Float64}, typeof(Main.Issue35972.lg21)}, iter::Symbol) with 0 direct children
│ ├─ InferenceTimingNode: 0.000024/0.000024 on getproperty(::Base.Generator{Vector{Float64}, typeof(Main.Issue35972.lg21)}, f::Symbol) with 0 direct children
│ ├─ InferenceTimingNode: 0.000027/0.000027 on getindex(::Tuple{Float64, Int64}, 1::Int64) with 0 direct children
│ ├─ InferenceTimingNode: 0.000135/0.005432 on Main.Issue35972.lg21(::Float64) with 1 direct children
│ │ └─ InferenceTimingNode: 0.001037/0.005297 on Main.Issue35972.f21(::Float64) with 22 direct children
│ │ ⋮
│ │
│ └─ InferenceTimingNode: 0.000028/0.000028 on getindex(::Tuple{Float64, Int64}, 2::Int64) with 0 direct children
vs when it's loaded from the precompile cache:
│ ├─ InferenceTimingNode: 0.001050/0.001475 on iterate(::Base.Generator{Vector{Float64}, typeof(Issue35972.lg21)}) with 5 direct children
│ │ ├─ InferenceTimingNode: 0.000109/0.000109 on getproperty(::Base.Generator{Vector{Float64}, typeof(Issue35972.lg21)}, ::Symbol) with 0 direct children
│ │ ├─ InferenceTimingNode: 0.000075/0.000075 on getproperty(::Base.Generator{Vector{Float64}, typeof(Issue35972.lg21)}, iter::Symbol) with 0 direct children
│ │ ├─ InferenceTimingNode: 0.000080/0.000080 on getproperty(::Base.Generator{Vector{Float64}, typeof(Issue35972.lg21)}, f::Symbol) with 0 direct children
│ │ ├─ InferenceTimingNode: 0.000083/0.000083 on getindex(::Tuple{Float64, Int64}, 1::Int64) with 0 direct children
│ │ └─ InferenceTimingNode: 0.000079/0.000079 on getindex(::Tuple{Float64, Int64}, 2::Int64) with 0 direct children
Guess: since it already has the MethodInstance it's (inappropriately) not inferring lg21
.
(A briefer version of this was also posted to slack internals)
To help decide how to fix this (either (1) run inference before codegen, or (2) cache the needed Base CodeInstance
s), I tallied the number of CodeInstances in each internal
category using this diff:
diff --git a/src/dump.c b/src/dump.c
index 8c9d68f518..ce42448159 100644
--- a/src/dump.c
+++ b/src/dump.c
@@ -505,6 +505,8 @@ enum METHOD_SERIALIZATION_MODE {
METHOD_EXTERNAL_MT = 2,
};
+static int codeinst_count[3]; // number of external, thunk, and workslist code instances, respectively
+
static void jl_serialize_value_(jl_serializer_state *s, jl_value_t *v, int as_literal) JL_GC_DISABLED
{
if (jl_serialize_generic(s, v)) {
@@ -691,6 +693,7 @@ static void jl_serialize_value_(jl_serializer_state *s, jl_value_t *v, int as_li
internal = 1;
else if (module_in_worklist(mi->def.method->module))
internal = 2;
+ codeinst_count[internal] += 1;
write_uint8(s->s, internal);
if (!internal) {
// also flag this in the backref table as special
@@ -2240,6 +2243,7 @@ JL_DLLEXPORT int jl_save_incremental(const char *fname, jl_array_t *worklist)
return 1;
}
JL_GC_PUSH2(&mod_array, &udeps);
+ codeinst_count[0] = codeinst_count[1] = codeinst_count[2] = 0;
mod_array = jl_get_loaded_modules();
serializer_worklist = worklist;
@@ -2349,6 +2353,7 @@ JL_DLLEXPORT int jl_save_incremental(const char *fname, jl_array_t *worklist)
}
write_int32(&f, 0); // mark the end of the source text
ios_close(&f);
+ jl_printf(JL_STDOUT, "external: %d; module-thunk: %d; worklist: %d\n", codeinst_count[0], codeinst_count[1], codeinst_count[2]);
JL_GC_POP();
return 0;
Results: https://gist.github.com/timholy/502a5a545a69f492b6180050a0b745e6 (the packages were chosen as ones that now have substantial precompilation)
I don't fully understand the precompilation, but empirically it seems to me that the inference is context dependent. If external methods are cached by individual packages, will the behavior change depending on the order of loading?
I agree with the idea of caching more CodeInstance
s to improve the latency problem. However, I think that some kind of restriction, such as "strict" type stability, will be required.
So, IMO, it would be better to achieve (1) first.
xref #42016
This still seems to happen with FunctionWrappers.
Here's my test package:
module FWAllocs
using PrecompileTools
using FunctionWrappers
function test_precomp(N)
fw = FunctionWrappers.FunctionWrapper{Float64, Tuple{Float64}}(cos)
res = 0.0
for i in 1:N
res += fw(i*0.1)
end
return res
end
function test_no_precomp(N)
fw = FunctionWrappers.FunctionWrapper{Float64, Tuple{Float64}}(cos)
res = 0.0
for i in 1:N
res += fw(i*0.1)
end
return res
end
@setup_workload begin
@compile_workload begin
test_precomp(10)
end
end
end
and here's what happens on Julia 1.10.4:
julia> using FWAllocs
julia> @time FWAllocs.test_precomp(10)
0.053401 seconds (96.14 k allocations: 5.794 MiB, 99.86% compilation time)
8.177847573818267
julia> @time FWAllocs.test_precomp(10)
0.000004 seconds (22 allocations: 376 bytes)
8.177847573818267
julia> @time FWAllocs.test_precomp(20)
0.000011 seconds (42 allocations: 696 bytes)
8.377322108212503
julia> @time FWAllocs.test_no_precomp(20)
0.000007 seconds (2 allocations: 56 bytes)
8.377322108212503
julia>
Also, is it odd that test_precomp
needs to compile on first use, but test_no_precomp
does not?
I confirmed that the FanctionWrappers
issue was reproduced on nightly too.
julia> versioninfo()
Julia Version 1.12.0-DEV.738
Commit 67c9989e83 (2024-06-15 01:40 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LLVM: libLLVM-17.0.6 (ORCJIT, tigerlake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Since the original problem with this issue seems to have been resolved, I think there is something else causing it. This issue is somewhat old, so it might be better to file it as a new issue.
Of course, since FunctionWrappers
deals with pointers, the precompiled cache should not be used as is. So, it is reasonable in itself that the recompilation takes place on the first use of test_precomp
.
If I make a package
M
with this content ofsrc/M.jl
:and run
the result is
Without
precompile(f, (Float64,))
the excessive allocations go away.