JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.64k stars 5.48k forks source link

Allocates when precompiled #35972

Closed GunnarFarneback closed 2 years ago

GunnarFarneback commented 4 years ago

If I make a package M with this content of src/M.jl:

module M
f(x) = x > 0 ? cbrt(x) : 0.0
precompile(f, (Float64,))
end

and run

using M, BenchmarkTools
x1 = fill(0.0, 1000000);
x2 = fill(1.0, 1000000);
@btime M.f.($x1);
@btime M.f.($x2);

the result is

  842.301 μs (2 allocations: 7.63 MiB)
  20.586 ms (2000002 allocations: 38.15 MiB)

Without precompile(f, (Float64,)) the excessive allocations go away.

GunnarFarneback commented 4 years ago

Minimized from

using Colors, BenchmarkTools
x1 = fill(XYZ{Float32}(0,0,0), 1000, 1000);
x2 = fill(XYZ{Float32}(1,0,0), 1000, 1000);
@btime convert.(Lab, $x1);
@btime convert.(Lab, $x2);
kimikage commented 4 years ago

As mentioned in https://github.com/JuliaGraphics/Colors.jl/issues/425, this problem is more serious than the impression given by the MWE.

In the case of MWE, f() will be eventually recompiled because f() is in a "shallow" place. However, in the Colors.jl, cbrt is in "deep" places from the public API (e.g. colordiff and distinguishable_colors), so it is difficult to avoid the problem.

This problem seems to have nothing to do with type instability. The problem occurs even if the concrete type is specified explicitly.

As far as I can tell, not only cbrt but also exp has the same problem. ~Their common point is Union{Float32, Float64}. The problem doesn't occur with log, which has separate log(::Float32) and log(::Float64).~

kimikage commented 3 years ago

The difference in the likelihood of problems depending on the type of function may be due to the difference in the cost of inlining. If so, this could be a variation of https://github.com/JuliaLang/julia/issues/34055.

kimikage commented 3 years ago

IIRC, in the typical cases, this could be avoided by not running precompile().

However, at least in Julia v1.6.1, there seems to be a case where __precompile__(false) is required.

kimikage commented 3 years ago

There is a difference in whether the trigger is a precompilation or a JIT compilation, but the phenomenon is similar to #35537.

kimikage commented 3 years ago

FYI, here is an MWE that shows that this is related to inlining.

module Issue35972

lg20(x) = x > 0.0 ? -f20(1.0 - x) : 0.0
lg21(x) = x > 0.0 ? -f21(1.0 - x) : 0.0

# inline-worthy
f20(x) = @evalpoly(x,
    0.0, 1/1, 1/2, 1/3, 1/4, 1/5, 1/6, 1/7, 1/8, 1/9,
    1/10, 1/11, 1/12, 1/13, 1/14, 1/15, 1/16, 1/17, 1/18, 1/19,
    1/20)

# not inline-worthy
f21(x) = @evalpoly(x,
    0.0, 1/1, 1/2, 1/3, 1/4, 1/5, 1/6, 1/7, 1/8, 1/9,
    1/10, 1/11, 1/12, 1/13, 1/14, 1/15, 1/16, 1/17, 1/18, 1/19,
    1/20, 1/21)

precompile(lg20, (Float64,))
precompile(lg21, (Float64,))

end # module
julia> using Issue35972

julia> x = rand(1_000_000);

julia> Issue35972.lg20.(x);

julia> Issue35972.lg21.(x);

julia> @time Issue35972.lg20.(x);
  0.004678 seconds (4 allocations: 7.629 MiB)

julia> @time Issue35972.lg21.(x);
  0.022517 seconds (2.00 M allocations: 38.147 MiB)

julia> versioninfo()
Julia Version 1.8.0-DEV.333
Commit c12e63fba3 (2021-08-11 02:09 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)
timholy commented 3 years ago

In the demo in the OP (and every other one I've tried), running the second @btime under @profile and then viewing the results with Profile(; C=true) contains quite a few sections that contain [unknown stackframe]:

   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 33   @Base/broadcast.jl:670; _broadcast_getindex_evalf
  5╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  33   @M/src/M.jl:3; f
  2╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   2    /home/tim/src/julia-master/cli/loader_exe.c:14; jl_get_pgcstack_static
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   1    /home/tim/src/julia-master/src/gc.c:1205; jl_gc_pool_alloc
  1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    1    /home/tim/src/julia-master/src/gc.c:880; maybe_collect
  1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   1    /home/tim/src/julia-master/src/gc.c:1229; jl_gc_pool_alloc
  1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   1    /home/tim/src/julia-master/src/gc.c:1246; jl_gc_pool_alloc
  1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   1    /home/tim/src/julia-master/src/gf.c:2250; jl_invoke
  1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   1    /home/tim/src/julia-master/src/gf.c:2251; jl_invoke
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   1    /home/tim/src/julia-master/src/gf.c:2252; jl_invoke
  1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    1    /home/tim/src/julia-master/src/gf.c:2224; _jl_invoke
  1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   20   [unknown stackframe]
  2╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    2    @Base/special/cbrt.jl:0; cbrt(x::Float64)
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    2    @Base/special/cbrt.jl:146; cbrt(x::Float64)
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     2    @Base/special/cbrt.jl:68; _approx_cbrt
  1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 1    @Base/int.jl:87; +
  1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 1    @Base/int.jl:280; div

Also worth noting that the line number for the statement after the [unknown_stackframe] is 0.

timholy commented 3 years ago

This is also interesting:

julia> using M, BenchmarkTools, Profile

julia> @code_llvm M.f(3.2)
;  @ /tmp/pkgs/M/src/M.jl:3 within `f`
define double @julia_f_273(double %0) #0 {
top:
  %1 = alloca {}*, align 8
  %gcframe7 = alloca [3 x {}*], align 16
  %gcframe7.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 0
  %2 = bitcast [3 x {}*]* %gcframe7 to i8*
  call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(24) %2, i8 0, i32 24, i1 false)
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #6
  %ppgcstack_i8 = getelementptr i8, i8* %thread_ptr, i64 -8
  %ppgcstack = bitcast i8* %ppgcstack_i8 to {}****
  %pgcstack = load {}***, {}**** %ppgcstack, align 8
; ┌ @ operators.jl:378 within `>`
; │┌ @ float.jl:490 within `<` @ float.jl:444
    %3 = bitcast [3 x {}*]* %gcframe7 to i64*
    store i64 4, i64* %3, align 16
    %4 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 1
    %5 = bitcast {}** %4 to {}***
    %6 = load {}**, {}*** %pgcstack, align 8
    store {}** %6, {}*** %5, align 8
    %7 = bitcast {}*** %pgcstack to {}***
    store {}** %gcframe7.sub, {}*** %7, align 8
    %8 = fcmp ule double %0, 0.000000e+00
; └└
  br i1 %8, label %L8, label %L6

L6:                                               ; preds = %top
  %ptls_field3 = getelementptr inbounds {}**, {}*** %pgcstack, i64 2305843009213693954
  %9 = bitcast {}*** %ptls_field3 to i8**
  %ptls_load45 = load i8*, i8** %9, align 8
  %10 = call noalias nonnull {}* @jl_gc_pool_alloc(i8* %ptls_load45, i32 1392, i32 16) #2
  %11 = bitcast {}* %10 to i64*
  %12 = getelementptr inbounds i64, i64* %11, i64 -1
  store atomic i64 140704265234768, i64* %12 unordered, align 8
  %13 = bitcast {}* %10 to double*
  store double %0, double* %13, align 8
  %14 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 2
  store {}* %10, {}** %14, align 16
  store {}* %10, {}** %1, align 8
  %15 = call nonnull {}* @jl_invoke({}* inttoptr (i64 140704310911760 to {}*), {}** nonnull %1, i32 1, {}* inttoptr (i64 140704435100720 to {}*))
  %16 = bitcast {}* %15 to double*
  %17 = load double, double* %16, align 8
  %18 = load {}*, {}** %4, align 8
  %19 = bitcast {}*** %pgcstack to {}**
  store {}* %18, {}** %19, align 8
  ret double %17

L8:                                               ; preds = %top
  %20 = load {}*, {}** %4, align 8
  %21 = bitcast {}*** %pgcstack to {}**
  store {}* %20, {}** %21, align 8
  ret double 0.000000e+00
}

julia> x2 = fill(1.0, 1000000);

julia> @btime M.f.($x2);
  24.999 ms (2000002 allocations: 38.15 MiB)

julia> @code_llvm M.f(3.2)
;  @ /tmp/pkgs/M/src/M.jl:3 within `f`
define double @julia_f_457(double %0) #0 {
top:
; ┌ @ operators.jl:378 within `>`
; │┌ @ float.jl:490 within `<` @ float.jl:444
    %1 = fcmp ule double %0, 0.000000e+00
; └└
  br i1 %1, label %L8, label %L6

L6:                                               ; preds = %top
  %2 = call double @j_cbrt_459(double %0) #0
  ret double %2

L8:                                               ; preds = %top
  ret double 0.000000e+00
}

Changing that first @code_llvm to @code_typed gives the good performance/codegen, regardless of whether one sets optimize=true or optimize=false.

timholy commented 3 years ago

Partial progress: the difference seems to come down to the fact that we get to https://github.com/JuliaLang/julia/blob/eb83c4d25b9c4a3696c08d2b5c7debfa9cc1a5ae/src/codegen.cpp#L3696-L3699 with handled = false when it's loaded from the cache, but with handled = true when we define the module in-session. This is because the CodeInstance isn't cached (it's not internal) and so https://github.com/JuliaLang/julia/blob/eb83c4d25b9c4a3696c08d2b5c7debfa9cc1a5ae/src/codegen.cpp#L3654 returns jl_nothing.

timholy commented 3 years ago

And more progress. With this diff that puts debugging statements in both codegen and type-inference:

diff --git a/base/compiler/typeinfer.jl b/base/compiler/typeinfer.jl
index a3895c380f..052c0a4baa 100644
--- a/base/compiler/typeinfer.jl
+++ b/base/compiler/typeinfer.jl
@@ -850,9 +850,12 @@ end
 # compute (and cache) an inferred AST and return type
 function typeinf_ext(interp::AbstractInterpreter, mi::MethodInstance)
     method = mi.def::Method
+    println(mi)
     for i = 1:2 # test-and-lock-and-test
         i == 2 && ccall(:jl_typeinf_begin, Cvoid, ())
         code = get(code_cache(interp), mi, nothing)
+        println(i)
+        println(code)
         if code isa CodeInstance
             # see if this code already exists in the cache
             inf = code.inferred
@@ -895,14 +898,22 @@ function typeinf_ext(interp::AbstractInterpreter, mi::MethodInstance)
         end
     end
     if ccall(:jl_get_module_infer, Cint, (Any,), method.module) == 0
+        println("retrieving ", mi)
         return retrieve_code_info(mi)
     end
+    println("inferring ", mi)
     lock_mi_inference(interp, mi)
     frame = InferenceState(InferenceResult(mi), #=cached=#true, interp)
     frame === nothing && return nothing
     typeinf(interp, frame)
     ccall(:jl_typeinf_end, Cvoid, ())
     frame.src.inferred || return nothing
+    print("for ", mi, ", the cache is ")
+    if isdefined(mi, :cache)
+        println(typeof(mi.cache.inferred))
+    else
+        println("not cached")
+    end
     return frame.src
 end

diff --git a/src/codegen.cpp b/src/codegen.cpp
index d05080ee42..022d1958bb 100644
--- a/src/codegen.cpp
+++ b/src/codegen.cpp
@@ -3622,6 +3622,7 @@ static jl_cgval_t emit_invoke(jl_codectx_t &ctx, jl_expr_t *ex, jl_value_t *rt)

     bool handled = false;
     jl_cgval_t result;
+    jl_(ex);
     if (lival.constant) {
         jl_method_instance_t *mi = (jl_method_instance_t*)lival.constant;
         assert(jl_is_method_instance(mi));
@@ -3643,6 +3644,7 @@ static jl_cgval_t emit_invoke(jl_codectx_t &ctx, jl_expr_t *ex, jl_value_t *rt)
         else {
             jl_value_t *ci = ctx.params->lookup(mi, ctx.world, ctx.world); // TODO: need to use the right pair world here
             jl_code_instance_t *codeinst = (jl_code_instance_t*)ci;
+            jl_printf(JL_STDOUT, "Is the CodeInstance lookup nothing? %d\n", ci == jl_nothing);
             if (ci != jl_nothing && codeinst->invoke != jl_fptr_sparam) { // check if we know we definitely can't handle this specptr
                 if (codeinst->invoke == jl_fptr_const_return) {
                     result = mark_julia_const(codeinst->rettype_const);

with the debugging to Core.Compiler not built (hit Ctrl-C after makeing the C code), doing the following:

using Issue35972, MethodAnalysis
cd("/home/tim/src/julia-master/base")
include("compiler/compiler.jl")
x = rand(1_000_000);
map(Issue35972.lg21, x);    # map is less compile-intensive than broadcast

yields the following final section:

...
inferring collect_similar(Array{Float64, 1}, Base.Generator{Array{Float64, 1}, typeof(Issue35972.lg21)}) from collect_similar(Any, Any)
for collect_similar(Array{Float64, 1}, Base.Generator{Array{Float64, 1}, typeof(Issue35972.lg21)}) from collect_similar(Any, Any), the cache is Array{UInt8, 1}
Expr(:invoke, _collect(Array{Float64, 1}, Base.Generator{Array{Float64, 1}, typeof(Issue35972.lg21)}, Base.EltypeUnknown, Base.HasShape{1}) from _collect(Any, Any, Base.EltypeUnknown, Union{Base.HasLength, Base.HasShape{N} where N}), Base._collect, Core.Argument(n=2), Core.Argument(n=3), quote Base.EltypeUnknown() end, quote Base.HasShape{1}() end)
Is the CodeInstance lookup nothing? 0
Expr(:invoke, evalpoly(Float64, Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}) from evalpoly(Any, Tuple), Base.Math.evalpoly, SSAValue(75), (0, 1, 0.5, 0.333333, 0.25, 0.2, 0.166667, 0.142857, 0.125, 0.111111, 0.1, 0.0909091, 0.0833333, 0.0769231, 0.0714286, 0.0666667, 0.0625, 0.0588235, 0.0555556, 0.0526316, 0.05, 0.047619))
Is the CodeInstance lookup nothing? 1
Expr(:invoke, evalpoly(Float64, Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}) from evalpoly(Any, Tuple), Base.Math.evalpoly, SSAValue(24), (0, 1, 0.5, 0.333333, 0.25, 0.2, 0.166667, 0.142857, 0.125, 0.111111, 0.1, 0.0909091, 0.0833333, 0.0769231, 0.0714286, 0.0666667, 0.0625, 0.0588235, 0.0555556, 0.0526316, 0.05, 0.047619))
Is the CodeInstance lookup nothing? 1
evalpoly(Float64, Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}) from evalpoly(Any, Tuple)
1
nothing
2
nothing
inferring evalpoly(Float64, Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}) from evalpoly(Any, Tuple)
for evalpoly(Float64, Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}) from evalpoly(Any, Tuple), the cache is Array{UInt8, 1}

but then next:

mi = methodinstance(evalpoly, (Float64, NTuple{22,Float64}))
julia> mi.cache
Core.CodeInstance(MethodInstance for evalpoly(::Float64, ::NTuple{22, Float64}), #undef, 0x000000000000360b, 0xffffffffffffffff, Float64, #undef, nothing, true, true, Ptr{Nothing} @0x00007f0400263df0, Ptr{Nothing} @0x00007f0400263d50)

The inferred field is not cached! And I'm a little surprised that codegen tries twice before executing inference, and then doesn't try again. Whereas if you define the module in-session, the final section looks like this:

inferring collect_similar(Array{Float64, 1}, Base.Generator{Array{Float64, 1}, typeof(Issue35972.lg21)}) from collect_similar(Any, Any)
for collect_similar(Array{Float64, 1}, Base.Generator{Array{Float64, 1}, typeof(Issue35972.lg21)}) from collect_similar(Any, Any), the cache is Array{UInt8, 1}
Expr(:invoke, _collect(Array{Float64, 1}, Base.Generator{Array{Float64, 1}, typeof(Issue35972.lg21)}, Base.EltypeUnknown, Base.HasShape{1}) from _collect(Any, Any, Base.EltypeUnknown, Union{Base.HasLength, Base.HasShape{N} where N}), Base._collect, Core.Argument(n=2), Core.Argument(n=3), quote Base.EltypeUnknown() end, quote Base.HasShape{1}() end)
Is the CodeInstance lookup nothing? 0
Expr(:invoke, evalpoly(Float64, Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}) from evalpoly(Any, Tuple), Base.Math.evalpoly, SSAValue(75), (0, 1, 0.5, 0.333333, 0.25, 0.2, 0.166667, 0.142857, 0.125, 0.111111, 0.1, 0.0909091, 0.0833333, 0.0769231, 0.0714286, 0.0666667, 0.0625, 0.0588235, 0.0555556, 0.0526316, 0.05, 0.047619))
Is the CodeInstance lookup nothing? 0
Expr(:invoke, evalpoly(Float64, Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}) from evalpoly(Any, Tuple), Base.Math.evalpoly, SSAValue(24), (0, 1, 0.5, 0.333333, 0.25, 0.2, 0.166667, 0.142857, 0.125, 0.111111, 0.1, 0.0909091, 0.0833333, 0.0769231, 0.0714286, 0.0666667, 0.0625, 0.0588235, 0.0555556, 0.0526316, 0.05, 0.047619))
Is the CodeInstance lookup nothing? 0
timholy commented 3 years ago

@snoopi_deep turns out to be pretty good for debugging. From a section when you define the module in the current session:

      ├─ InferenceTimingNode: 0.000380/0.005951 on iterate(::Base.Generator{Vector{Float64}, typeof(Main.Issue35972.lg21)}) with 6 direct children
      │  ├─ InferenceTimingNode: 0.000037/0.000037 on getproperty(::Base.Generator{Vector{Float64}, typeof(Main.Issue35972.lg21)}, ::Symbol) with 0 direct children
      │  ├─ InferenceTimingNode: 0.000024/0.000024 on getproperty(::Base.Generator{Vector{Float64}, typeof(Main.Issue35972.lg21)}, iter::Symbol) with 0 direct children
      │  ├─ InferenceTimingNode: 0.000024/0.000024 on getproperty(::Base.Generator{Vector{Float64}, typeof(Main.Issue35972.lg21)}, f::Symbol) with 0 direct children
      │  ├─ InferenceTimingNode: 0.000027/0.000027 on getindex(::Tuple{Float64, Int64}, 1::Int64) with 0 direct children
      │  ├─ InferenceTimingNode: 0.000135/0.005432 on Main.Issue35972.lg21(::Float64) with 1 direct children
      │  │  └─ InferenceTimingNode: 0.001037/0.005297 on Main.Issue35972.f21(::Float64) with 22 direct children
      │  │     ⋮
      │  │     
      │  └─ InferenceTimingNode: 0.000028/0.000028 on getindex(::Tuple{Float64, Int64}, 2::Int64) with 0 direct children

vs when it's loaded from the precompile cache:

│     ├─ InferenceTimingNode: 0.001050/0.001475 on iterate(::Base.Generator{Vector{Float64}, typeof(Issue35972.lg21)}) with 5 direct children
│     │  ├─ InferenceTimingNode: 0.000109/0.000109 on getproperty(::Base.Generator{Vector{Float64}, typeof(Issue35972.lg21)}, ::Symbol) with 0 direct children
│     │  ├─ InferenceTimingNode: 0.000075/0.000075 on getproperty(::Base.Generator{Vector{Float64}, typeof(Issue35972.lg21)}, iter::Symbol) with 0 direct children
│     │  ├─ InferenceTimingNode: 0.000080/0.000080 on getproperty(::Base.Generator{Vector{Float64}, typeof(Issue35972.lg21)}, f::Symbol) with 0 direct children
│     │  ├─ InferenceTimingNode: 0.000083/0.000083 on getindex(::Tuple{Float64, Int64}, 1::Int64) with 0 direct children
│     │  └─ InferenceTimingNode: 0.000079/0.000079 on getindex(::Tuple{Float64, Int64}, 2::Int64) with 0 direct children

Guess: since it already has the MethodInstance it's (inappropriately) not inferring lg21.

timholy commented 3 years ago

(A briefer version of this was also posted to slack internals)

To help decide how to fix this (either (1) run inference before codegen, or (2) cache the needed Base CodeInstances), I tallied the number of CodeInstances in each internal category using this diff:

diff --git a/src/dump.c b/src/dump.c
index 8c9d68f518..ce42448159 100644
--- a/src/dump.c
+++ b/src/dump.c
@@ -505,6 +505,8 @@ enum METHOD_SERIALIZATION_MODE {
     METHOD_EXTERNAL_MT = 2,
 };

+static int codeinst_count[3];  // number of external, thunk, and workslist code instances, respectively
+
 static void jl_serialize_value_(jl_serializer_state *s, jl_value_t *v, int as_literal) JL_GC_DISABLED
 {
     if (jl_serialize_generic(s, v)) {
@@ -691,6 +693,7 @@ static void jl_serialize_value_(jl_serializer_state *s, jl_value_t *v, int as_li
             internal = 1;
         else if (module_in_worklist(mi->def.method->module))
             internal = 2;
+        codeinst_count[internal] += 1;
         write_uint8(s->s, internal);
         if (!internal) {
             // also flag this in the backref table as special
@@ -2240,6 +2243,7 @@ JL_DLLEXPORT int jl_save_incremental(const char *fname, jl_array_t *worklist)
         return 1;
     }
     JL_GC_PUSH2(&mod_array, &udeps);
+    codeinst_count[0] = codeinst_count[1] = codeinst_count[2] = 0;
     mod_array = jl_get_loaded_modules();

     serializer_worklist = worklist;
@@ -2349,6 +2353,7 @@ JL_DLLEXPORT int jl_save_incremental(const char *fname, jl_array_t *worklist)
     }
     write_int32(&f, 0); // mark the end of the source text
     ios_close(&f);
+    jl_printf(JL_STDOUT, "external: %d; module-thunk: %d; worklist: %d\n", codeinst_count[0], codeinst_count[1], codeinst_count[2]);
     JL_GC_POP();

     return 0;

Results: https://gist.github.com/timholy/502a5a545a69f492b6180050a0b745e6 (the packages were chosen as ones that now have substantial precompilation)

kimikage commented 3 years ago

I don't fully understand the precompilation, but empirically it seems to me that the inference is context dependent. If external methods are cached by individual packages, will the behavior change depending on the order of loading?

I agree with the idea of caching more CodeInstances to improve the latency problem. However, I think that some kind of restriction, such as "strict" type stability, will be required.

So, IMO, it would be better to achieve (1) first.

timholy commented 3 years ago

xref #42016

amilsted commented 4 months ago

This still seems to happen with FunctionWrappers.

Here's my test package:

module FWAllocs

using PrecompileTools
using FunctionWrappers

function test_precomp(N)
    fw = FunctionWrappers.FunctionWrapper{Float64, Tuple{Float64}}(cos)
    res = 0.0
    for i in 1:N
        res += fw(i*0.1)
    end
    return res
end

function test_no_precomp(N)
    fw = FunctionWrappers.FunctionWrapper{Float64, Tuple{Float64}}(cos)
    res = 0.0
    for i in 1:N
        res += fw(i*0.1)
    end
    return res
end

@setup_workload begin
    @compile_workload begin
        test_precomp(10)
    end
end

end

and here's what happens on Julia 1.10.4:

julia> using FWAllocs

julia> @time FWAllocs.test_precomp(10)
  0.053401 seconds (96.14 k allocations: 5.794 MiB, 99.86% compilation time)
8.177847573818267

julia> @time FWAllocs.test_precomp(10)
  0.000004 seconds (22 allocations: 376 bytes)
8.177847573818267

julia> @time FWAllocs.test_precomp(20)
  0.000011 seconds (42 allocations: 696 bytes)
8.377322108212503

julia> @time FWAllocs.test_no_precomp(20)
  0.000007 seconds (2 allocations: 56 bytes)
8.377322108212503

julia> 

Also, is it odd that test_precomp needs to compile on first use, but test_no_precomp does not?

kimikage commented 4 months ago

I confirmed that the FanctionWrappers issue was reproduced on nightly too.

julia> versioninfo()
Julia Version 1.12.0-DEV.738
Commit 67c9989e83 (2024-06-15 01:40 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LLVM: libLLVM-17.0.6 (ORCJIT, tigerlake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

Since the original problem with this issue seems to have been resolved, I think there is something else causing it. This issue is somewhat old, so it might be better to file it as a new issue.

kimikage commented 4 months ago

Of course, since FunctionWrappers deals with pointers, the precompiled cache should not be used as is. So, it is reasonable in itself that the recompilation takes place on the first use of test_precomp.