new call site inference algorithm

JeffBezanson commented 4 years ago

This issue elaborates on parts of #33326.

One of the most crucial choices in our type inference algorithm is what to do when a call site has multiple possible targets (i.e. the inferred argument types aren't specific enough to know exactly which method would be called). Currently, we simply infer up to 4 possible targets, and if more than 4 match we give up and say Any. So for example you can get the following behavior:

julia> f(::Int) = 0;
julia> f(::String) = 0;
julia> f(::Dict) = 0;
julia> f(::Array) = 0;

julia> g() = f(Any[1][1]);

julia> @code_typed g()
...
) => Int64

julia> f(::Tuple) = 0;

julia> @code_typed g()
...
) => Any

This is clearly non-ideal, since the inferred type changes abruptly when a fifth unrelated method is added. To some extent this is unavoidable --- any inference of call sites with inexact argument types might change if a method is added or its signature changes. But, a new algorithm could still do better in three ways: (1) be less arbitrary and more intuitive, (2) make inference generally faster, (3) reduce the likelihood of inference "blowups" as in issues like #34098, #34681.

I have tried a couple alternate algorithms; data incoming!

JeffBezanson commented 4 years ago

This table compares three algorithms within julia 1.4-rc1. The first 4 rows are the standard algorithm. Algorithm A bails out as soon as both the inferred argument types and return type are abstract. Algorithm B replaces abstract argument types with the signature in the definition (For example you define f(x::Number) and we know x is an Integer, but we infer f(::Number). The idea of this is to infer many fewer novel abstract signatures). In all cases there is still a max_methods parameter, m_m in the table.

algorithm	Base	Stdlib	pre	sys.so	#34098	#34681	plot	fails
m_m = 1	31.5	49.8	121	127	0.08	2.14	13.74	9
m_m = 2	35.6	50.8	131	134	0.22	2.61	14.38	4
m_m = 4	35.8	58.9	143	146	42.36	100.49	16.53	0
m_m = 10	∞
A, m_m=4	34.2	53.1	135	138	0.21	2.63	15.27	9
A, m_m=10	37.7	54.4	144	146	0.21	2.84	16.26	9
B, m_m=4	40.8	64.2	126	138	0.15	2.45	15.23	3
B, m_m=10	238.5	∞
C	33.6	51.6	127	131	0.13			3

Column meanings: Base - Base time during build Stdlib - Stdlib time during build pre - Time to generate precompile statements sys.so - Size in MiB 34098, 34681 - The slow times from those issues plot - @time display(plot(rand(10))) first time fails - Number of failures in compiler/inference tests

Note: I excluded the patch that fixes #34098

JeffBezanson commented 4 years ago

Patch for algorithm A:

--- a/base/compiler/abstractinterpretation.jl
+++ b/base/compiler/abstractinterpretation.jl
@@ -98,6 +98,10 @@ function abstract_call_gf_by_type(@nospecialize(f), argtypes::Vector{Any}, @nosp
                 this_rt === Any && break
             end
         else
+            if max_methods != -1 && i > 1 && (!isdispatchtuple(sig) && !isdispatchelem(rettype))
+                rettype = Any
+                break
+            end
             this_rt, edgecycle1, edge = abstract_call_method(method, sig, match[2]::SimpleVector, multiple_matches, sv)
             edgecycle |= edgecycle1::Bool
             if edge !== nothing

Patch for algorithm B:

--- a/base/compiler/abstractinterpretation.jl
+++ b/base/compiler/abstractinterpretation.jl
@@ -98,7 +98,17 @@ function abstract_call_gf_by_type(@nospecialize(f), argtypes::Vector{Any}, @nosp
                 this_rt === Any && break
             end
         else
-            this_rt, edgecycle1, edge = abstract_call_method(method, sig, match[2]::SimpleVector, multiple_matches, sv)
+            sparams = match[2]::SimpleVector
+            if max_methods != -1 && multiple_matches && !isdispatchtuple(sig)
+                _sig = sig = method.sig
+                _sp = []
+                while _sig isa UnionAll
+                    push!(_sp, _sig.var)
+                    _sig = _sig.body
+                end
+                sparams = svec(_sp...)::SimpleVector
+            end
+            this_rt, edgecycle1, edge = abstract_call_method(method, sig, sparams, multiple_matches, sv)
             edgecycle |= edgecycle1::Bool
             if edge !== nothing
                 push!(edges, edge)

Other suggestions welcome!

staticfloat commented 4 years ago

Is there a way to quantify the performance impact of incompletely-inferred chunks of code? For instance, if I have a large chunk of code with many highly over-ridden methods, I imagine that I will run afoul of the max_methods threshold quite often, and as such I will be unable to compile a large, statically-inferred chunk of code. This means that my runtime will have more instances of the compiler needing to stop, look around, and choose a call site given the concretely-known (at runtime) types, correct? Even if that is fast, it seems like it would be good to be able to quantify what exactly the performance impact of such a thing is, especially if it's within a tight loop or something.

The columns you've listed so far seem like they would capture that time difference, but I'm not sure how much of the variation is due to inference taking longer versus runtime being slower. Perhaps what we can do is hook up the timing profiler to a known workload, then find all backtraces that point to the dynamic dispatch locations to get a measurement of the % of time spent in dispatch overhead versus actually running user workload?

JeffBezanson commented 4 years ago

Yes, good point, these numbers don't really say much about run-time performance impact. The cases measured here are basically all compile time. However, these changes only affect code where we were already unable to infer concrete types, so dynamic dispatch was already happening. "Fast"/fully-statically-typed code should not be affected.

The bad case would be where inferring a more accurate abstract type somewhere allows much more accurate results elsewhere. That does happen, but it's fairly rare. I doubt we have many (or any) benchmarks that capture that kind of case, since we only tend to benchmark things that we feel are or should be really highly optimized.

staticfloat commented 4 years ago

However, these changes only affect code where we were already unable to infer concrete types, so dynamic dispatch was already happening. "Fast"/fully-statically-typed code should not be affected.

I thought that the rows in your table where m_m != 4 would change the runtime characteristics. Wouldn't an m_m == 1 potentially cause many more cases of dynamic dispatch than would otherwise be happening? By your response, I'm guessing you are mentally filtering out all m_m != 4 rows for consideration?

JeffBezanson commented 4 years ago

It could cause more dynamic dispatch, but why many more cases? Maybe give an example of a change in how code might be optimized, and we can discuss which rows of the table could cause it?

If a call has multiple targets (at inference time), it would already ~always have been a dynamic dispatch. So performance will only change if (1) there were 4 or fewer possible targets, (2) inferring all of those and unioning their return types gave useful type information, and (3) subsequent calls using the result could be narrowed down to 1 target based on that information. While that is definitely possible I don't think it's common. Put differently, if this change slows down your code, then a package somewhere adding a 5th matching method would also have slowed down your code, so you're in a bit of a fragile position anyway. Is this the sort of situation you're thinking of?

staticfloat commented 4 years ago

Put differently, if this change slows down your code, then a package somewhere adding a 5th matching method would also have slowed down your code, so you're in a bit of a fragile position anyway. Is this the sort of situation you're thinking of?

Yes, that's precisely what I'm thinking. I don't have an intuition as to where this is or is not the case; perhaps the majority of collections of methods within a given namespace are either 1 or much larger than 4; I have no idea. I don't want to drag this issue off-topic, since I think this is more focused on compile-time performance than runtime performance, but I did my best to use the toy example from above to construct a pathological case. Here we have a method that is expected to do roughly nothing, so that we can attempt to see the effect of dynamic dispatch on the runtime performance:

using InteractiveUtils, BenchmarkTools, Profile, Printf

f(::Int) = 0
f(::String) = 0
f(::Dict) = 0
f(::Array) = 0

function f_loop(N)
    x = Any[1][1]

    accum = 0
    for idx in 1:N
        accum += f(x)
    end
    return accum
end

Benchmarking this with the following harness:


function do_benchmark(N)
    ci, rettype = @code_typed f_loop(N)
    println(" -> Inferred as returning: $(rettype)")

    # Benchmark it
    perf = @benchmark f_loop(N)
    min_time = minimum(perf).time
    println(@sprintf(" -> benchmark minimum:  %.2f ms", min_time/1e6))

    # Also do a Profile-based estimation of the amount of time we spend in `jl_apply_generic`:
    Profile.clear()
    @profile f_loop(N)
    prof_data, lidict = Profile.retrieve()
    lilist, inclusive, self, num_snapshots = Profile.parse_flat(Profile.StackFrame, prof_data, lidict, true)
    # The following line is probably not 100% correct; but it gives us a reasonable number, so I'm keeping it.
    num_gf_snapshots = sum([self[idx] for idx in 1:length(lilist) if basename(string(lilist[idx].file)) == "gf.c"])
    gf_proportion = num_gf_snapshots/num_snapshots
    println(@sprintf(" -> estimated gf overhead percentage:  %.2f%%", 100*gf_proportion))

    return min_time, gf_proportion
end

And invoking it with N = 10000000, first for the four-method case, then the five-method case allows us to estimate the impact of these algorithms at runtime:

N = 10000000
println("Four methods run of f_loop():")
fast_time, fast_gf_proportion = do_benchmark(N)

f(::Tuple) = 0
println("Five methods run of f_loop():")
slow_time, slow_gf_proportion = do_benchmark(N)

println(@sprintf("Estimated dynamic dispatch overhead per call: %.2f ns", (slow_time - fast_time)/N))

This results in the following timings:

Four methods run of f_loop():
 -> Inferred as returning: Int64
 -> benchmark minimum:  5.13 ms
 -> estimated gf overhead percentage:  0.00%
Five methods run of f_loop():
 -> Inferred as returning: Any
 -> benchmark minimum:  220.30 ms
 -> estimated gf overhead percentage:  78.90%
Estimated dynamic dispatch overhead per call: 21.52 ns

Now I don't think that 20ns per dynamic dispatch is very bad! (that estimate is actually fairly stable for me across a sweep of N). On my 3.3 GHz processor, that's something like 60-70 clock cycles, that's like less than two L3 hits. Not that I think there's a lot of optimization to be done here, just that I think it's interesting to note the potential overhead of a function call if we end up missing an inference opportunity.

perrutquist commented 4 years ago

Would it make sense to keep a table that tracks some kind of "widest possible return type" for each function, and return that (instead of Any) when inference gives up?

In the example from the top post: When the fifth method: f(::Tuple) is added, then its widest possible return type is computed, and the result (Int64) is merged with the previously widest return type of f (which was also Int64, so the entry does not need to be updated in this case). When type inference gives up, due to there being > 4 targets, it can now return Int64 instead of Any. (A further optimization would be to skip inference entirely when the "widest return type" is already a concrete type.)

A lot of functions probably already have well-defined output types. For example, it seems like there are 81 methods of length in Base, all of which return Int64. Hence a statement like length(Any[1][1]) could be inferred to return an Int64 as long as the user has not added any methods to length that return something else. (The "performance tips" section of the manual could mention this.) Edit: Bad example, sorry. There are indeed some methods of length that have unknown return types. (There was a bug in the for-loop that I wrote to check.) There are probably a fair number of functions that have a widest return type of Union{Missing, Bool} (or that could have, with some minor tweaking).

The drawback would be that adding a method to a function potentially invalidates the widest possible type not only for that function, but also recursively for any function that calls it. However, each function's entry would be updated a maximum of 2-3 times (e.g: concrete type -> small type union -> Any) so that the total amount of work would still be linear in the number of functions.

StefanKarpinski commented 4 years ago

Perhaps I'm misunderstanding the suggestion, but the trouble is that you have to analyze all of the methods that could possibly be called in order to know the widest possible return type. It cannot return Int64 without analyzing the other methods and proving that they all return Int64.

JeffBezanson commented 4 years ago

That idea does make sense. Actually it's similar to my "algorithm B" above, which infers methods on their defined signatures. Those are all cached, so once they are inferred once each one is just a lookup. Of course that is still O(n) in the number of methods, so additionally caching the overall type might be an improvement.

StefanKarpinski commented 4 years ago

Let me see if I'm understanding algorithm B. Currently, inference is run on each method for the tightest type that is known for the possible arguments and the return type and everything else is inferred based on that. A method must have non-empty intersection with a method's signature in order to be considered—since otherwise it's not applicable—but you can't just reuse the inference because it is, in general, bespoke to a given call site. The idea behind algorithm B is to run inference for the types of the method's signature, which are more general than the types at the call site intersected with the method's signature—and therefore potentially less informative—with the upside being that it can be cached and reused per method instead of per call site.

JeffBezanson commented 4 years ago

Yes exactly.

perrutquist commented 4 years ago

I guess that what I suggested is very similar to to algorithm B, but doing the extra work of computing the widest type per function (as if m_m were ∞). In a lot of cases, the "widest type" will reach Any fairly quickly, at which point it never has to be updated again. (However, in those cases there was also no benefit in computing the widest type.)

In the useful cases, having a ceiling on the return type would speed up inference. Whether the net result is positive or negative is hard to guess.

Widest types also be computed eagerly during module precompilation and/or in a background thread. (As can algorithm-B return types.)

JeffBezanson commented 4 years ago

We're doing some early experiments trying to compute and store the widest possible return type for each function. So far it seems to be quite expensive --- inferring all methods (and often on abstract argument types) does more work in many cases than we do now. Currently, we're usually able to avoid inferring many methods entirely, and when we do infer them it's often on concrete types which is more efficient. So this might not be viable.

rfourquet commented 4 years ago

Currently one the rules for programming in Julia is that type annotation in methods signature is only for dispatch, and beginners are regularly corrected in their belief that it has any impact on performance. Wouldn't algorithm B mean invalidation of this rule, and "force" us (when maximal performance is sought) to write concretely-typed signatures for some functions when abstractly-typed signatures would lead to non-concretely inferred code? (sorry if I misunderstood totally this topic)

StefanKarpinski commented 4 years ago

What about computing the widest return type for each concrete method signature and doing so lazily? (Perhaps that’s not any different from what is currently being done.)

JeffBezanson commented 4 years ago

Wouldn't algorithm B mean invalidation of this rule, and "force" us (when maximal performance is sought) to write concretely-typed signatures for some functions

That can already happen. Method signatures can affect inference precision. Say you have a call f(x) where the type of x is unknown. If the single method for f is declared as f(y::Int), then we know x is an Int. If it's declared f(y::Any) we don't get any extra information.

But in that scenario, the performance issue is the call site f(x) where x's type is unknown. That's what would need to be fixed for performance, not the signature of f (although changing the signature of f would help somewhat). That's what we mean by "signatures don't affect performance". That guideline also reflects the fact that signatures don't (usually) affect specialization.

Anyway, that rule of thumb only exists to counter the first impression people sometimes get that "my code will be really slow unless I declare types on all arguments"; it's not a formal property.

(Perhaps that’s not any different from what is currently being done.)

Yes; when we're lucky enough that a method signature is concrete we will infer it as soon as we see a possible call to it, and cache the result. Unfortunately most functions have some methods with abstract signatures.

StefanKarpinski commented 4 years ago

What about computing the widest return type for each concrete method signature and doing so lazily?

To elaborate on my suggestion, since it's kind of vague, the idea here is that instead of intersecting type information from the call site and the method that might get called, you broaden the type information from the call site to the signature of the method and then infer and cache that, the premise being that different call sites might project onto the same type information for the same method. In writing that, I guess this can be generalized: any way of projecting call site information onto a function or method in a way that is guaranteed to contain the actual return type could be used. The question is what projection would work best in the sense of being often reusable and giving usable type information.

JeffBezanson commented 4 years ago

I think that's basically algorithm B.

perrutquist commented 4 years ago

A few things that could make the widest-type computation faster. (I don't know if any of these have already been tried.)

For tail-recursive methods, such as f(x::WrapperType) = f(x.data) no inference needs to be done. Such methods cannot make the widest type any wider. (The same applies to branches that end in a tail call to the same function.)
Performing inference on the methods with the widest input types first would increase the likelihood of encountering an output type of Any early on, thus reducing the number of methods that need to be inferred for that function.
The threshold at which a Union is widened to Any could be lower when searching for widest-type than when doing normal inference. (This would also reduce the number of methods that inference needs to look at.)

vtjnash commented 4 years ago

Closes #23210

JeffBezanson commented 4 years ago

I added algorithm C to the table above. This sets max_methods to 1, unless the function has fewer than 5 methods, in which case we use max_methods == 4. The intuition is that it's more valuable to infer multiple methods if that means inferring all methods. (My implementation of this isn't perfect but should be enough to estimate performance.) This could be a cheap way to lessen the impact of lost inference precision. Otherwise I would almost recommend just setting max_methods to 1; it's definitely the fastest and is easy to explain and non-arbitrary.

I also explored trying to do even less inference than max_methods==1, by recursively inferring only if the known argument types are concrete. That made compilation significantly slower, which contributes to my sense that max_methods==1 is fairly optimal.

JeffBezanson commented 4 years ago

Here are some new numbers and a new candidate algorithm. I had to take a new baseline because unfortunately most of the times have regressed over the past couple months.

Algorithm D uses max_methods==4 like we do now, except it bails out immediately if more than one intersected signature is abstract. The idea is that it's basically always useful and OK to infer concrete signatures, since you will probably need them eventually, but the number of abstract signatures we infer per call site should be limited.

algorithm	Base	Stdlib	pre	sys.so	TTFP	TTSP	fails
baseline	39.0	56.0	151	142	13.6	10.9	0
D	34.8	51.6	134	130	12.8	6.6	7

TTFP = time to first plot TTSP = time to second plot, after loading SIMD.jl

oscardssmith commented 4 years ago

what's pre in the chart?

JeffBezanson commented 4 years ago

Time to generate precompile statements (see above)

timholy commented 4 years ago

I'd love it if we could get away from a max # of methods altogether, since that would make the compiler completely predictable. To me it seems that the most important thing about inferring abstract calls is the predictability that the result type gives for future calls, i.e., in

h(c) = g(f(c[1]))

the benefit of successfully deducing that all methods of f return an Int is that you can look up the call of g ahead of time.

So here's a more radical proposal: what if we reserve union-splitting, as it currently exists, for cases where inference returns an explicit Union (with all concrete types) for c[1], and we do all abstract dispatch at runtime? But we also extend the current callsite-specific dispatch to support "multiple lookup." The compiled version of h(::Vector{Any}) might look something like this:

function h(c)
    x = c[1]
    d = dispatch_info(h, (typeof(x),))
    y, d = dispatch(d, f, x)
    z, _ = dispatch(d, g, y)
    return z
end

where that dispatch_info call would, after having been educated from previous runs, return a "choose your own adventure" nested list something like this:

gcall1 = svec(Int) => ginstance1      # ginstance1::MethodInstance
gcall2 = svec(Tuple{Float32, String}) => ginstance2

fcall1 = (svec(Int) => finstance1) => gcall1             # if you call f(::Int), then you'll call g(::Int)
fcall2 = (svec(String) => finstance2) => gcall1          # if you call f(::String), you'll also call g(::Int)
⋮
fcalln = (svec(Tuple{Int,Int}) => finstancen) => gcall2  # but f(::Tuple) returns (Float32, String), so you'll call g(::Tuple{Float32, String})

d = [fcall1, fcall2, ..., fcalln]

Then, as soon as you match the signature triggering the fcall1 adventure, you have already pre-populated dispatch table for the remaining calls. In favorable cases all of the remaining calls will be simple type-checks that will pass and you can directly invoke the supplied MethodInstance.

Not saying this is easy to implement; when you have multiple objects that end up determining how calls dispatch, I wonder if this too might blow up. But such strategies seem to deserve at least a moment's reflection.

A feature of this proposal is that we'd never run inference with "intermediate" abstract types: all argument types would either be concrete or would act like Any, and as soon as a lookup involves an Any you switch to dynamic mode. In my investigations of invalidation I've noticed that we may have a problem stemming from a growing number @nospecialize methods, which trigger inference of callees with abstract input types, even if the callees are not so annotated. Currently these are a sizable source of invalidations.

timholy commented 4 years ago

An alternative implementation of this general idea might be to support "method fragments." Using the example above, the compiled version of h(::Vector{Any}) would look like

function h(c::Vector{Any})
    x = c[1]
    _h_fragment(x)
end

_h_fragment(x) = g(f(x))

You dynamically look up which instance of _h_fragment to call, but this gives you a chance to use inference through two calls.

If h instead were

function h(c)
    s = 0
    for x in c
        y = f(x)
        z = g(y)
        s += k(y, z)
    end
    return sin(s) + 2*length(c)
end

then the compiled version would look like

function h(c::Vector{Any})
    s = 0
    for x in c
        s = _h_fragment1(s, x)
    end
    return _h_fragment2(s, 2*length(c))    # 2*length(c) could either stay in the "parent" or move to fragment (its result is inferrable)
end

function _h_fragment1(s, x)
    y = f(x)
    z = g(y)
    return s + k(y, z)
end

_h_fragment2(s, m) = sin(s) + m

On balance I suspect method fragments would be more performant than "multiple lookup," while perhaps also being easier to implement. You basically terminate inference as soon as you get to a non-concrete type and split out the remainder of the lowered code, respecting scope and block boundaries. Inference still needs to handle abstract types to compute return types, but most type-inferred method bodies with non-concrete ssavalues become pretty short.

The concern is this might generate too many methods, but since they would be shorter I wonder if that would be a smaller problem that it would be now?

perrutquist commented 4 years ago

This wouldn't have to affect the size of the global method table, since each function could have its own "fragment table", right? (Fragments cannot be called from anywhere else.)

If this is implemented, then the user would no longer have to worry about inserting function barriers as the compiler would do it automatically!

JuliaLang / julia

new call site inference algorithm #34742