Guiding Speculation - Githubissues

RossTate commented 3 years ago

If we're exploring speculation now, it would be good to provide the generator with a means to guide speculation. This would both help make better use of speculation and help make the effect of speculation more predictable.

At present, it sounds like speculation is being done at the call-site level. But other VMs do speculation at the "class" level. This has the effect of grouping a bunch of call sites (e.g. calls to methods on the same receiver) and propagating speculation into inlined calls (that is, if one inlines a method implementation that calls a method on this, then that method can be devirtualized rather than have to respeculated).

One way we could support this is by adding an instruction like guarded_inline $local $field num instr* end. This instruction informs the engine to make a guarded inline cache with num cases. If the value of struct.get $field (local.get $local) matches one of these cases, then the corresponding optimized code is jumped to. That optimized code is instr* compiled with the assumption that struct.get $field (local.get $local) returns that value (where $field must be immutable, and assuming $local doesn't change within instr*).

For example, if $field is a v-table, then this code will be optimized for that particular v-table. In particular, the results of any loads (say of funcrefs) from immutable fields of this v-table can be inlined, as can any calls to those loaded funcrefs. Similarly, if i-tables are made immutable, then the result of the entire search through the i-table can be inlined, as can the call on the retrieved funcref. This recreates the "class"-level speculation used by other VMs.

(If we go with the nominal approach and furthermore add direct support for adding fields—such as v-table funcrefs—to the object descriptor of a given nominal type, then this becomes even easier to guide.)

jakobkummerow commented 3 years ago

If we're exploring speculation now

I think this premise alone deserves quite a bit of debate. I see speculation first and foremost as an engine-internal implementation detail. Engines are free to do it, or not do it, or do it sometimes, or whatever. I think standardizing a programmable (i.e. controllable by the module) speculation or inline-caching system (which are two distinct things, btw) in engines is waaay out of scope for WasmGC MVP; and it remains to be seen/discussed whether we'll ever want to do that.

RossTate commented 3 years ago

What I described has no effect on semantics, so engines are free to do it or not do it. It's just like branch hinting. And like branch hinting, this seemed like a space where generator guidance could be very useful, so I thought an explicit instruction (which could be in a custom section instead) might help.

kripken commented 3 years ago

@RossTate Do you have a sense of how important class-level speculation is? If there are impls that get close to JVM performance using lower-level primitives that would be interesting.

RossTate commented 3 years ago

Our ahead-of-time and separate compiler (though with post-linking optimizations, such as inlining of direct calls) seems to get close to JVM's best performance except in cases where class-level speculation enables substantial eliminations of memory allocations. (I say "best" because the JVM performs very inconsistently across runs, with runs even after warmup often being far slower than our system, whereas our compiler performs fairly consistently across runs.) But even in those cases, our compiler still outperforms Node.js. So my sense is that speculation does not need to be important for WebAssembly to outperform Node.js. As for low-level primitives, the only one we use for our Java-like subset (our compiler supports two interoperating languages) that is not currently supported in WebAssembly is call tags (which we use for interface-method dispatch and for lambdas). But we also don't have the overheads in #249, which when artificially encoded do make our performance worse than Node.js.

Another thing to consider is that one of the main applications of speculation in the JVM is to address its separate compilation and dynamic linking model. For example, the JVM speculates that a class has no subclasses or that an interface has only one or two implementing classes. But if you're doing whole-program compilation, then you can already catch a lot of these cases through a simple analysis (i.e. what classes have instances that can reach this point in the program). So since you're already doing whole-program compilation and optimization, including devirtualization, that will leave far fewer opportunities for JVM-style speculation and dynamic recompilation to have an impact.

kripken commented 3 years ago

Interesting, thanks @RossTate !

If I want to read more on this, is there a paper perhaps? I'd like to understand details of the optimizations your compiler does.

titzer commented 3 years ago

The issue of static versus dynamic compilation of languages is an age-old one. Even since Java's inception, there was a spate of early work in static, whole-program compilation of Java programs. There are probably hundreds of papers on the subject and dozens of different attempts at it. I think it's an unsettled question. What is clear though, is that Java and other languages have dynamic code generation facilities (i.e. class loading) that make most forms of static whole program optimization impossible. Reflection and metaprogramming happen a lot in practice in Java, not the least of which is due to new kinds of dependency injection frameworks and the like. So we cannot bet everything on static optimization.

I think we might be getting ahead of ourselves concluding too much from a single benchmark. Until we have a significant applications and benchmarks running on Wasm (e.g. why not DaCapo or the Renaissance suite), it'd be a mistake to make conclusions, and proposing mechanisms here is just a thought exercise and record of thinking more than anything.

kripken commented 3 years ago

Makes sense @titzer , thanks. Do you perhaps have a reference to a summary of the state of the art in this area? Basically what I have not been able to get is a clear answer to how good can we expect an AOT compiler of Java to actually do. In particular I'd be curious about production AOT compilers, like Graal.

For comparison, I am aware of (edit: production) AOT compilers for C# that at least at one point were considered fairly fast compared to dynamic VMs (il2cpp vs Mono).

I definitely agree a single benchmark is not enough to focus on - I'd be curious to hear more about the benchmarking you were doing @RossTate ?

RossTate commented 3 years ago

You can find our paper here. It has a section on how it uses call tags, though it's small and focuses mostly on how we use call tags to support efficient cross-paradigm interop (which is the focus of the paper). You might find this old paper more interesting, as it introduces the IMT implementation strategy that call tags generalize, and it has a bunch of analysis on performance specifically in the context of (old) Java. (This was the precursor to the JikesRVM.)

Something to be aware of regarding both that paper and the Da Capo benchmark suite is that they're fairly dated. They both use benchmarks that were predominantly in the Java 1.3/1.4 era, and the Java language/SDK/community has changed significantly since then. For example, you will find a lot more interfaces, more generics (and autoboxing), more invokedynamic, and more lambdas in Java programs nowadays. So nowadays these benchmarks are considered easy.

That said, when we were trying to warn Andreas about the overhead that would likely be caused by things like casts 3.5 years ago, we modified the JikesRVM to insert casts at the start of class method implementations overriding methods from superclasses or implementing interface methods, and we measured the overheads on the DaCapo benchmark suite. The overheads we observed then are in line with the overheads people are observing now, so in this regard the JikesRVM and the DaCapo suite have aged quite well.

As for our benchmarks, you can find the various translations of our benchmarks across the 9 languages we compared here. One thing to note, though, is that for purposes of comparing fairly with other gradual-typing systems (rather than industry systems), we used generics when supported by the language. When comparing with industry systems, this introduces a bunch of complications that are not particularly relevant to the issues at hand, so for #249 I monomorphized the benchmark to use (unboxed) 64-bit integers. The specific benchmark I used in #249 is sort (in GitHub) / intersort (in the paper) because it is the one that specifically stresses interface-method dispatch, which I understand is the feature currently causing the most trouble for us. That said, I agree we should not deduce too much from one benchmark, but we have to start somewhere, and my goal is to get group collaboration going on our current serious issue with performance.

What is clear though, is that Java and other languages have dynamic code generation facilities (i.e. class loading) that make most forms of static whole program optimization impossible. Reflection and metaprogramming happen a lot in practice in Java, not the least of which is due to new kinds of dependency injection frameworks and the like. So we cannot bet everything on static optimization.

I am not trying to say that speculation is useless. But the examples you just gave of where speculation really needs to come into play are all things we're not aiming to support in the MVP to begin with. Furthermore, the old paper I cited above (along with more recent things) gives data on why (even for "easy" benchmarks) we cannot rely too heavily on speculation—many Java programs have lots of megamorphic method invocations. So what I am trying to say (in #249) is that we should first try to get WebAssembly's performance in line with other ahead-of-time compilers, which already outperform JavaScript, before jumping to speculation. What I was trying to say in this thread is that, if we are going to embrace speculation, then we should do so in a user-guided manner, even if engines are free to ignore user hints. Either way, my sense is that we should get to speculation when we get to supporting the Post-MVP features that speculation is particularly useful for.

RossTate commented 3 years ago

For comparison, I am aware of (edit: production) AOT compilers for C# that at least at one point were considered fairly fast compared to dynamic VMs (il2cpp vs Mono).

My understanding is that even C# VMs do not make much use of speculation or closed-world assumptions. The jitting feature is primarily for faster loading, supporting dynamic loading, and supporting dynamic code generation. So the compilation model is not far from WebAssembly's. I believe the major difference is that compilation is done during instantiation, which means it doesn't have to deal with instance objects, and means that the compiler can incorporate the specific values used at instantiation (e.g. inlining imported functions or specializing to particular constants/types).

kripken commented 3 years ago

Thanks @RossTate ! Very interesting, I'll be taking a close look at all that.

Meanwhile I've been looking for data on production AOT compilers for Java, and the best I can find is this 2019 talk on Graal

https://qconsf.com/system/files/presentation-slides/qconsf2019-alina-yurenko-jit-vs-aot-performance-with-graalvm.pdf

Slide 22 has an Apache benchmark which makes the argument that PGO is needed for AOT to match JIT (or else it loses some 20% or so). Overall the talk suggests AOT is best for startup and JIT for peak performance.

RossTate commented 3 years ago

Oh, nice find! I wish I had suggestions to help you on your search.

One more thing I found is that their AOT compiler significantly outperforms HotSpot's JIT for Scala programs, even without profiling. This article it refers to seems to suggest that this is due to HotSpot's JIT being specifically tuned for common patterns in Java that Scala programs do not exhibit as often (despite compiling to the same VM), possibly even hurting performance. Their "enterprise" AOT compiler (without profiling information) outperforms HotSpot's JIT by more than 40%, despite seemingly not being designed for specifically Scala.

My sense from that is that AOT can provide overall good performance in a general-purpose manner, which JIT can improve upon when properly guided. It sounds like that room for improvement when properly guided is plausibly around 20% (which coincidentally is how much speedup we observed when we fully devirtualized the benchmark discussed above). (This is in a closed-world setting. In an open-world setting, which would be nice to support efficiently in a Post-MVP, there is likely many more opportunities for JIT to improve performance even with less guiding.)

kripken commented 3 years ago

I agree with the summary in the last paragraph. However, even if we agree that the performance loss of AOT is that 20%, I worry that is when comparing well-tuned AOTs. And J2Wasm+Binaryen may have a way to go before they can match the AOT optimizations that something like Graal AOT does.

My specific problem now in Binaryen is inlining. I am having trouble finding good heuristics for what to inline, as everything appears equally worthwhile (as far as I can tell, after many hours on this), and so what really matters is what is actually called at runtime. Advice would be very welcome. Various things I've been trying: code size heuristics, presence of loops, presence of calls, whether something is a constructor or not, etc. I have a long list of other ideas, but no clear direction yet...

If we can't solve inlining in the toolchain then only the VM side could save us with speculation. So I think VM speculation is very important to investigate.

RossTate commented 3 years ago

I am having trouble finding good heuristics for what to inline

Yeah, I do not envy you.

So I think VM speculation is very important to investigate.

Totally fine. This thread was meant to offer a way to facilitate that.

If we can't solve inlining in the toolchain then only the VM side could save us with speculation.

I think this captures my primary concern. Both AOT and JIT have limitations. In particular, neither of them can help if you have a truly megamorphic call site. Figure 7 of the old paper I referenced before found that megamorphic call sites were not uncommon, and in particular were more common in the larger programs. And that's for very old Java programs. With things like the Stream API and the introduction of lambdas, interfaces like Iterator and Function have a bunch of classes implementing it in many programs. For example, if a program uses Stream.map (on the same class implementing stream) with three different lambdas, then that class's implementation of Stream.map now has a megamorphic call site to Function, and one that will be used heavily inside a loop using the corresponding Iterator.

So, regardless of how good the AOT or JIT optimizations are, many Java programs will not perform well if unoptimized interface-method dispatch does not perform well. For our implementation, interface-method dispatch seems to be under 20% slower than direct function calls. I'd be interested to know what that looks like at present for J2wasm. I think the experiment I suggested in https://github.com/WebAssembly/gc/issues/249#issuecomment-942649890 would give a sense of that.

kripken commented 3 years ago

For our implementation, interface-method dispatch seems to be under 20% slower than direct function calls. I'd be interested to know what that looks like at present for J2wasm.

It looks something like:

Load itable from the object.
Load vtable from the right index in the itable.
Cast the vtable to the proper specific vtable type.
Load the function reference from the right field in the cast vtable.
Call that function reference.

I don't know how fast each of those are in the VM, but my impression is that this is significantly slower than a direct call, and far slower still than an inlined direct call (as each time one of these gets devirtualized things get noticeably faster).

tlively commented 2 years ago

Closing this because adding speculative optimization hints is out of scope for the MVP, although PRs adding ideas for hints to the post-MVP doc would be welcome.

WebAssembly / gc

Guiding Speculation #251