Closed timholy closed 3 years ago
Along the lines of this issue and #38080, the new SnoopCompile 2.2.0 has a not-yet-documented "profile-guided deoptimization" toolset. Here's a demo running on Franklin:
Each dot is a Method. You can see the inference time (including for the MethodInstance's callees) vs the amount of "self" time (not including callees) it spent at runtime. Some of the ones with more inference time are very cheap to infer once but are inferred a whole bunch of times for different constants (e.g., getproperty
, the yellow dot in the plot). But even the top-most left dot has been inferred for 38 different type+const combinations, even though there are only 6 MethodInstances. I haven't broken out how much constprop accounts for that, but it's pretty interesting to see.
Here's a concrete example of the impact of this issue: the kwfunc for this method in DataFrames ends up taking a whopping 23s of inference time while running the DataFrames tests (the total test suite is ~700s, so just inference on just that one method is ~3% of the total time). It gets inferred 1778 separate times. Of the first 4 arguments of the kwfunc,
(::DataFrames.var"#_combine_prepare##kw")(::Any, ::typeof(DataFrames._combine_prepare), gd::GroupedDataFrame, cs)
the number of distinct slottypes (including Const
, PartialStruct
, etc) is:
705
1
1
573
So the Boolean kwargs account for a ton of re-inference, and despite the @nospecialize
annotation on cs
it ends up triggering a heck of a lot of inference. It's worth pointing out that
julia> m.nospecialize
8
where m
is this method, so the @nospecialize
does propagate to argument 4 of the kwfunc, but the constant-prop on the NamedTuple seems to force re-inference.
A small sample of the diversity of the NamedTuple is
julia> arg1
Base.IdSet{Any} with 705 elements:
Core.PartialStruct(NamedTuple{(:keepkeys, :ungroup, :copycols, :keeprows, :renamecols), NTuple{5, Bool}}, Any[Core.Const(true), Bool, Core.Const(true), Core.Const(false), Core.Const(true)])
Core.PartialStruct(NamedTuple{(:copycols, :keepkeys, :ungroup, :keeprows, :renamecols), NTuple{5, Bool}}, Any[Bool, Core.Const(true), Core.Const(true), Core.Const(true), Bool])
Core.PartialStruct(NamedTuple{(:keepkeys, :ungroup, :copycols, :keeprows, :renamecols), NTuple{5, Bool}}, Any[Bool, Bool, Core.Const(true), Core.Const(false), Bool])
Core.PartialStruct(NamedTuple{(:keepkeys, :ungroup, :copycols, :keeprows, :renamecols), NTuple{5, Bool}}, Any[Core.Const(true), Bool, Core.Const(true), Core.Const(false), Core.Const(true)])
⋮
and that of the cs
argument
julia> arg4
Base.IdSet{Any} with 573 elements:
Tuple{Pair{Vector{Symbol}, Pair{Main.TestGrouping.var"#367#376"{Int64, Int64, UnitRange{Int64}}, Symbol}}}
Tuple{Pair{Vector{Int64}, Pair{Main.TestGrouping.var"#198#275", Symbol}}}
Core.PartialStruct(Tuple{Colon, Vararg{Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex}, N} where N}, Any[Core…
Tuple{Pair{Symbol, Symbol}, Pair{Symbol, ByRow{typeof(sin)}}}
Tuple{Pair{Symbol, Pair{Main.TestGrouping.var"#510#522", Symbol}}, Pair{Symbol, Pair{Main.TestGrouping.var"#511#523", Symbol}}}
Tuple{Pair{Symbol, Pair{Main.TestGrouping.var"#573#587"{typeof(sum)}, Symbol}}}
Tuple{Pair{Symbol, Pair{Main.TestGrouping.var"#581#595"{ComposedFunction{typeof(prod), typeof(skipmissing)}}, Symbol}}}
Tuple{Pair{Symbol, Pair{Main.TestGrouping.var"#278#293"{typeof(maximum)}, Symbol}}}
Core.Const((Main.TestGrouping.var"#633#673"(),))
⋮
Here's the sorted set (essentially the cumulative distribution) of inference times for each of these runs:
so you can see there are quite a few that take non-negligible time.
Now that #38080 is merged, I'd like to propose that we turn off const-prop on keyword arguments by default; anyone who wants it can annotate with @aggressive_constprop
. Julia packages are chock full of methods like f(args...; allow_failure=false)
where it's completely useless to const-specialize on allow_failure
. (See the DataFrames example above for a real-world cost.)
Just investigated a case where precompilation was not as effective as I hoped it would be and reduced it until I realized it was this issue. Here is the MWE:
module Foo
bar(x::Int) = hcat(rand())
foo(x) = bar(1)
#foo(x) = bar(x) # this precompiles just fine
@assert precompile(foo, (Int,))
end # module
Then, in a fresh session:
julia> first(methods(hcat, Tuple{Number})).specializations # none yet
svec()
julia> using Foo
julia> first(methods(Foo.foo)).specializations
svec(MethodInstance for Foo.foo(::Int64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)
julia> first(methods(Foo.bar)).specializations
svec(MethodInstance for Foo.bar(::Int64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)
julia> first(methods(hcat, Tuple{Number})).specializations # contains hcat(::Float64) from Foo
svec(MethodInstance for hcat(::Float64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)
julia> using SnoopCompileCore
julia> tinf = @snoopi_deep Foo.foo(1);
julia> using SnoopCompile, AbstractTrees
julia> print_tree(tinf, maxdepth=3)
InferenceTimingNode: 0.018412/0.044012 on Core.Compiler.Timings.ROOT() with 1 direct children
└─ InferenceTimingNode: 0.000297/0.025600 on Foo.foo(::Int64) with 1 direct children
└─ InferenceTimingNode: 0.000358/0.025303 on Foo.bar(1::Int64) with 2 direct children
├─ InferenceTimingNode: 0.000404/0.003161 on rand() with 1 direct children
│ ⋮
│
└─ InferenceTimingNode: 0.000777/0.021783 on hcat(::Float64) with 13 direct children
⋮
So precompilation has generated the expected MethodInstance
s, but they get inferred again. Now I understand why bar(1)
might be inferred if only bar(::Int)
is stored in the precompile file. (Is it?) But note that also hcat(::Float64)
gets inferred again, although it was never part of the const-propagation itself. That looks completely unnecessary.
if only bar(::Int) is stored in the precompile file. (Is it?)
Funny you should ask. I got tired of not really understanding this well myself, so I am partway through writing PrecompileCacheUtils.jl, with the aim of providing a non-instantiating mirror of the content in the .ji file (essentially like src/dump.c
's `jl_deserialize` pipeline but non-instantiating). But it's not working yet.
We could add a macro @no_constprop
(sort of the opposite of #38080), but I am hesitant to suggest that too strongly since every annotation we add makes Julia more complicated for the developer. I'd much rather find a systematic solution to this problem, if possible.
Meanwhile, to get a sense for how common this issue is, I'm contemplating special coloration in the flamegraphs to highlight constprop (maybe orange just like we use for gc in performance profiling). Thoughts?
I'm still trying to get a handle on what the problem is, exactly. To systematically go through different cases, I slightly changed above module Foo
to contain
foo() = bar(1)
foo(x) = bar(x)
and then added precompiles for eitherr of them or both and also ran either of them or both under @snoopi_deep
. I was looking at whether hcat(::Float64)
had to be (re-)inferred. (To summarize again, foo()
involves constant propagation, foo(::Int)
does not, but either way, hcat
is never called with a constant.)
Here is what I got (:x: means hcat(::Float64) was inferred): |
run foo() |
run foo(1) |
run foo(); foo(1) |
run foo(1); foo() |
|
---|---|---|---|---|---|
no precompile | :x: | :x: | :x: :heavy_check_mark: | :x: :heavy_check_mark: | |
precompile foo() |
:x: | :heavy_check_mark: | :x: :heavy_check_mark: | :heavy_check_mark: :x: | |
precompile foo(::Int) |
:x: | :heavy_check_mark: | :x: :heavy_check_mark: | :heavy_check_mark: :x: | |
precompile foo() , foo(::Int) |
:x: | :heavy_check_mark: | :x: :heavy_check_mark: | :heavy_check_mark: :x: | |
precompile foo(::Int) , foo() |
:x: | :heavy_check_mark: | :x: :heavy_check_mark: | :heavy_check_mark: :x: |
(In the last two columns, the first result is of course redundant with the respective column before.) There are some aspects that make sense:
foo
always causes inference of hcat(::Float64)
. The second run doesn't, it uses the result obtained in the first run.foo
is precompiled. Reasonable, as they both involve inference of hcat(::Float64)
.foo
method is precompiled, foo(1)
does not need to infer hcat(::Float64)
, but can apparently use the result from precompilation.However, calling foo()
always requires inference of hcat(::Float64)
, unless it was already inferred by a call to foo(1)
in the running session. (The only :heavy_check_mark: for foo()
is in the top-right table cell.)
So if I may venture a generalization, it looks like this: When inference is within a callchain that involves constant propagation, it cannot use inference results from precompile files, only those produced in the running session (and probably those from the sysimg?). Does that comply with your observations?
Very interesting! One thing to check my understanding, when you say "was inferred" you mean "had previously been inferred so that it didn't need to infer it again when I ran it," right? Moreover, in all cases you're loading the module from a cache file (using Foo
) and not just defining it in the REPL. If I define Foo it in the REPL, any precompile directive is enough to ensure that hcat(::Float64)
does not need re-inference.
You conclusion makes sense, and in particular suggests that this is not a limitation of our cache files but instead a constraint we might need on the current running session.
One other thing we don't do is cache hcat(::Float64)
's inference results. This appears to be a consequence of inlining, but presumably those results are available prior to the inlining decision, and I wonder if it would be useful to cache them in case they are handy on their own.
I also wonder if #41328 with bar(x::Int) = @noinline hcat(rand())
would "fix" this. This isn't the way you want to fix it, but it would be a proof-of-principle.
One thing to check my understanding, when you say "was inferred" you mean "had previously been inferred so that it didn't need to infer it again when I ran it," right?
You mean in " (:x: means hcat(::Float64)
was inferred)"? Then no, it means that inference of hcat(::Float64)
happened when running (as reported by @snoopi_deep
). Looks (to me) as if it is always inferred during precompilation, but the result is not always re-used, so :x: means result not re-used (though it should be available).
Moreover, in all cases you're loading the module from a cache file (
using Foo
) and not just defining it in the REPL.
Correct.
If I define Foo it in the REPL, any precompile directive is enough to ensure that
hcat(::Float64)
does not need re-inference.
So definition in REPL with precompilation works just as fine as running without precompilation.
You conclusion makes sense, and in particular suggests that this is not a limitation of our cache files but instead a constraint we might need on the current running session.
So to avoid any misunderstanding, this is what I see, if I interpret my results correctly:
hcat(::Float64) was inferred in running session |
hcat(::Float64) inference results come from cache file |
|
---|---|---|
inference of new method without any const-propagation | :heavy_check_mark: inference result re-used | :heavy_check_mark: inference result re-used |
inference of new method with some const-propagation | :heavy_check_mark: inference result re-used | :x: hcat(::Float64) is re-inferred |
Where I assume that if hcat(::Float64)
shows up in @snoopi_deep
output, it was re-inferred, otherwise the cached inference result is used.
One other thing we don't do is cache
hcat(::Float64)
's inference results.
Ah. So it's actually the reuse of bar
's inference results that makes the difference...? Then the presence of const-propagation at that point causing a difference makes much more sense.
This appears to be a consequence of inlining, but presumably those results are available prior to the inlining decision, and I wonder if it would be useful to cache them in case they are handy on their own.
If our analysis is correct up to here, then ... yes, that sounds useful.
I also wonder if #41328 with bar(x::Int) = @noinline hcat(rand()) would "fix" this. This isn't the way you want to fix it, but it would be a proof-of-principle.
Right, but I should be able to conduct the same experiment with
@noinline bar() = hcat(rand())
bar(x::Int) = bar()
right? Will do and report back...
Right, but I should be able to conduct the same experiment with ...
Hm, no, inlining there doesn't matter, but the extra indirection helps. My new module:
module Foo
# as before
bar(x::Int) = hcat(rand())
# with extra indrection
@inline bar() = hcat(rand())
bar(x::Float64) = bar()
foo1() = bar(1) # previously known as foo
foo2() = bar(1.0) # with extra indirection
@assert precompile(foo1, ())
@assert precompile(foo2, ())
end # module
(Again defined in a package and loaded from a precompile file). Then:
julia> print_tree(@snoopi_deep(Foo.foo1()), maxdepth=3)
InferenceTimingNode: 0.018147/0.045257 on Core.Compiler.Timings.ROOT() with 1 direct children
└─ InferenceTimingNode: 0.000343/0.027110 on Foo.foo1() with 1 direct children
└─ InferenceTimingNode: 0.000385/0.026768 on Foo.bar(1::Int64) with 2 direct children
├─ InferenceTimingNode: 0.000289/0.002451 on rand() with 1 direct children
│ ⋮
│
└─ InferenceTimingNode: 0.000925/0.023932 on hcat(::Float64) with 13 direct children
⋮
This was to be expected given everything discussed above. But now in a fresh session:
julia> print_tree(@snoopi_deep(Foo.foo2()), maxdepth=3)
InferenceTimingNode: 0.019031/0.020039 on Core.Compiler.Timings.ROOT() with 1 direct children
└─ InferenceTimingNode: 0.000323/0.001008 on Foo.foo2() with 2 direct children
├─ InferenceTimingNode: 0.000412/0.000412 on Foo.bar(::Float64) with 0 direct children
└─ InferenceTimingNode: 0.000273/0.000273 on Foo.bar(1.0::Float64) with 0 direct children
(Replacing the @inline
with @noinline
or leaving it off completely makes no difference.)
No sure what to make of it.
Trying to drill a little deeper, I've extended above Foo
module with two more functions
foo3() = bar()
foo4() = hcat(rand())
and their respective precompile
s. Those give
julia> print_tree(@snoopi_deep(Foo.foo3()), maxdepth=3)
InferenceTimingNode: 0.019327/0.019715 on Core.Compiler.Timings.ROOT() with 1 direct children
└─ InferenceTimingNode: 0.000388/0.000388 on Foo.foo3() with 0 direct children
and
julia> print_tree(@snoopi_deep(Foo.foo4()), maxdepth=3)
InferenceTimingNode: 0.018091/0.018091 on Core.Compiler.Timings.ROOT() with 0 direct children
(each in a fresh session). So foo4
actually is the only one that does not need any re-inference. Trying to understand why, I did (in a fresh session again, of course):
julia> visit(Foo) do item
if item isa Core.CodeInstance
println(item.def, ": ", repr(item.max_world))
end
return item isa Union{Module,Method,Function,Core.MethodInstance,Core.CodeInstance}
end
MethodInstance for Foo.bar(): 0xffffffffffffffff
MethodInstance for Foo.bar(::Int64): 0xffffffffffffffff
MethodInstance for Foo.bar(::Float64): 0x0000000000000000
MethodInstance for Foo.foo1(): 0x0000000000000000
MethodInstance for Foo.foo2(): 0x0000000000000000
MethodInstance for Foo.foo3(): 0x0000000000000000
MethodInstance for Foo.foo4(): 0xffffffffffffffff
Comparing with the inference trees, this makes sense: Inference can stop going deeper once it reaches a matching method/specialization where there is a CodeInstance
with max_world
not zeroed out.
But why are some of the max_world
s set to zero?
Ah, I think I've found something. Doesn't have anything to do with const-propagation directly, so sorry for hijacking this issue, but it should probably be sorted out before worrying about the interaction of precompilation and const-propagation. So here is what I believe is the cause of all the invalidated CodeInstance
s (those with max_world == 0
). When deserializing the module, we first initialize max_world = 0
everwhere, then at the end, we go through the edges to external methods, check their validity and if they are still valid, insert the needed backedges and mark the caller as valid by setting max_world
to 0xff...f
. But we only do this for methods which have edges to external methods. For those which have only module-internal callees, we completely skip this step, which is fine as far as inserting missing backedges goes - there aren't any - but we fail to mark them valid.
So I've tried reversing the logic, starting with all CodeInstance
valid and then marking the invalid ones as such. That made all the woes I had observed above go away, nothing needs to be re-inferred. For reference, this is the patch I've used:
diff --git a/src/dump.c b/src/dump.c
index 49fa6efa43..167acbf70a 100644
--- a/src/dump.c
+++ b/src/dump.c
@@ -1616,8 +1616,10 @@ static jl_value_t *jl_deserialize_value_code_instance(jl_serializer_state *s, jl
codeinst->precompile = 1;
codeinst->next = (jl_code_instance_t*)jl_deserialize_value(s, (jl_value_t**)&codeinst->next);
jl_gc_wb(codeinst, codeinst->next);
- if (validate)
+ if (validate) {
codeinst->min_world = jl_world_counter;
+ codeinst->max_world = ~(size_t)0;
+ }
return (jl_value_t*)codeinst;
}
@@ -2042,15 +2044,15 @@ static void jl_insert_backedges(jl_array_t *list, jl_array_t *targets)
jl_method_table_add_backedge(mt, callee, (jl_value_t*)caller);
}
}
- // then enable it
+ }
+ else {
+ // invalid, so disable it
jl_code_instance_t *codeinst = caller->cache;
while (codeinst) {
if (codeinst->min_world > 0)
- codeinst->max_world = ~(size_t)0;
+ codeinst->max_world = 0;
codeinst = jl_atomic_load_relaxed(&codeinst->next);
}
- }
- else {
if (_jl_debug_method_invalidation) {
jl_array_ptr_1d_push(_jl_debug_method_invalidation, (jl_value_t*)caller);
loctag = jl_cstr_to_string("insert_backedges");
Don't know whether this is 100% sound, it might be too optimistic, but what we have at the moment is definitely too pessimistic. @vtjnash ?
Don't know whether this is 100% sound
Of course it's not, as it is missing transitive invalidations. With which I mean, if internal_func1()
only calls internal_func2()
which in turn calls external_func()
, then at the moment, we always mark internal_func1()
invalid, no matter whether external_func()
was actually changed therby invalidating internal_func2()
. That's too pessimistic. With above patch, however, we never mark internal_func1()
invalid, even if external_func()
was changed, which is clearly too optimistic.
But even though we thus invalidate too little with this patch, it seems to have near to zero effect latency-wise on real-world packages. Probably a large fraction of functions without external dependencies are trivial wrappers (e.g. providing default args) which don't add any relevant amount of time to inference, so invalidating them unnecessarily isn't too bad. But for the kind of analysis we are doing here, it would still be nice to get rid of these, I guess. (Also, the fact that @snoopr
doesn't report them further hinders analysis.)
It seems we could solve this by maintaining an htable
mapping MethodInstance => handled::Bool
, and after the invalidation-linked pass go through the entire table and look for not-yet-handled instances. Then we mark them all as valid.
Let's take this to #41872 and then come back to interaction with const-specialization here eventually.
(I am not 100% certain this is the right title for this issue, but it seems likely to be the cause.)
TL; DR In reference to https://github.com/JuliaLang/julia/issues/30488#issuecomment-750783623, this illustrates at least one (and probably more than one) problem with precompilation. I think the most fundamental issue is that our backedges only record MethodInstances and not optimizations due to constant-propagation/specialization; when you go to use the code, Julia seems to decide that it needs to re-optimize the code and thus runs a fresh round of inference on it. A potential solution is to incorporate
Const
-data into our backedges, and cache all specializations. (There are potential cache-blowup issues to worry about, of course.) An alternative is to just use the partially-specialized code and not run inference again.It's worth noting that the tiny function in this demo accounts for the majority of remaining (aka, unfixed) inference time in https://github.com/JuliaImages/ImageFiltering.jl/pull/201, despite being outweighed by at least 100x in terms of the total amount of code that needs to be inferred to run
imfilter
. Another example I've encountered occurs inJuliaInterpreter.step_expr!
, where I suspectistoplevel::Bool
suffers from the same problem due to well-intentioned but very unhelpfulConst
-specialization. (I'm finding this issue to be a bigger problem with LoopVectorization, where it crops up for dozens of methods and is the main source of currently-unfixable "inference triggers.") Overall, this is a sign of just how wellprecompile
works in many circumstances. The failures illustrated here are serious but specific.Demo:
Session (requires the master branch of SnoopCompile):
Generally I would have expected no inference at all, given that we already had the MethodInstance. Note that in the tree output, you can see the
Const
-specialization from sequences likefoo(1::Int64)
which indicate thatfoo
wasConst
-specialized with an argument of value1::Int64
.It's worth starting a fresh session and doing the same analysis for
gethiddengaussian
, to see that it has different entrance-points into inference. But if you directly precompilehiddengaussian
(uncommenting that line), then you get back to something almost identical togetgaussian
.