WebAssembly / function-references

Proposal for Typed Function References
https://webassembly.github.io/function-references/
Other
101 stars 15 forks source link

Benefits of non-nullable references? #40

Closed kripken closed 2 years ago

kripken commented 4 years ago

I understand the benefit of non-nullable references in a source language, but I was curious what we expect the benefits to be in wasm? Is it for speed, or correctness, or something else?

The one minor benefit I can think of is that a wasm=>wasm optimizer could optimize out some ref.is_null checks based on it.

The overview mentions the overall benefits of the proposal, but none of those seem to apply to non-nullable references. I'm probably missing something obvious, though.

titzer commented 4 years ago

Two benefits I can see.

  1. There are some engines where an implicit nullcheck (e.g. in a struct.get) is not free, and costs a branch. Knowing the input is non-null means it can omit the branch.
  2. For those same instructions that have an implicit nullcheck that traps, knowing the input is non-null means that the instruction cannot trap. If we allow catching traps in the future, this means fewer control-flow edges that a compiler has to worry about, generally resulting in better code.
kripken commented 4 years ago

@titzer

About 1, which engines do you mean? It seems like every time you create a struct, you would zero-initialize the i32 zero fields just like we initialize locals and memory - and so then you could also initialize a nullable function reference to a (hidden, internal) "null function" that starts a trap. Then whenever you have a reference, nullable or not, you know you can always call it. Does the idea not work, or not work in some engines?

2 is an interesting point I hadn't considered! Yes, if we allow catching traps in the future it seems like it could help.

skuzmich commented 4 years ago
  1. For GC struct references, engines can protect a few memory pages after nullref address, and avoid generating null-check branches for memory accesses with reasonably small offsets. Should work for function references too.
MaxGraey commented 4 years ago

For GC struct references, engines can protect a few memory pages after nullref address, and avoid generating null-check branches for memory accesses with reasonably small offsets. Should work for function references too.

This usually only happens for a per-page allocation during memory.grow and when using mmap under the hood. Also it's cannot be done on 32-bit platforms. I don't think every nullref address will alloc by mmap and wrap by guarded pages. It would be a terrible waste of resources.

kripken commented 4 years ago

@skuzmich Thanks, yes, I think you're right. So like with linear memory, null checks can be "free" with the signal handler trick, and non-nullable references can't speed up GC references either.

@MaxGraey You'd only need to guard the region around 0 once, and then it helps you with every single check later.

(You're right it can't be done on 32-bit platforms, and places where the signal handler trick can't be used for other reasons - but in wasm we've already agreed basically that the signal handler trick is something we can depend on in most important cases, which is why wasm didn't add a different form of sandboxing like masking.)

skuzmich commented 4 years ago

Non-nullable types could be more efficient for ref.cast from GC proposal, and similar instructions that access struct RTT/header without trapping on null reference.

In addition to the point 2. Even without being able to catch traps in Wasm, Wasm state can be observed after trap in a hosts like JS. Thus, unless they inver nullability, Wasm compilers wound not be able to reorder nullable struct.gets without having to speculate and deoptimize. This can prevent various reordering optimizations, like loop-invarinat code motion.

skuzmich commented 4 years ago

I was wrong about ref.cast above. Adding trapping version of instructions would solve the problem.

rossberg commented 4 years ago

@kripken:

in wasm we've already agreed basically that the signal handler trick is something we can depend on in most important cases, which is why wasm didn't add a different form of sandboxing like masking.

I'm not sure I subscribe to that. Even with bounds checking you cannot rely on signal handling everywhere. 32 bit platforms are still real, so are techniques like pointer compression, or execution environments with no access to signals.

The knowledge that something isn't null also allows certain optimisations, e.g., in instructions following an explicit null check it can elide various cases where null is treated specially but without trapping, or when inlining a function performing an explicit null-check into a call site where the parameter is known to have non-null type. In simple cases, the engine could use flow analysis to derive similar information, but that does not work across function boundaries, except when inlining.

kripken commented 4 years ago

@rossberg

Even with bounds checking you cannot rely on signal handling everywhere. 32 bit platforms are still real, so are techniques like pointer compression, or execution environments with no access to signals.

I agree to all those. I think you might be responding to a point I didn't make, though, so apologies if I wasn't clear enough.

My point is that we knew 32-bit platforms etc. exist, and we considered adding a form of linear memory sandboxing like masking that does work on them and that is cheaper than explicit bounds checks. But we decided that the signal handler trick was possible in enough places, and the overhead of explicit bounds checks small enough, that it wasn't worth it. It seems like the same thinking could apply to GC, unless we expect 32-bit platforms to be more important there, or the overhead larger, for some reason?

The knowledge that something isn't null also allows certain optimisations

I agree with this too. It's related to @skuzmich 's point that

unless they inver nullability, Wasm compilers wound not be able to reorder nullable struct.gets

I think this is definitely worth measuring to see how important it is. As before, something similar happens with linear memory checks, where potentially trapping things like memory accesses and integer divides etc. cannot be reordered. I'm not aware of data showing that that is a significant issue, but it may be, and it may be different in GC.

Note btw that the binaryen optimizer has an ignoreImplicitTraps flag which does let it reorder memory operations and integer divides and other things that might trap. We don't have data showing that is useful for speed, but it can help code size in some cases by a small amount. I expect we will experiment with expanding that flag to GC somehow.

RossTate commented 4 years ago

To add another dimension of concern, I'm worried about how much non-null reference types will be generated. Many languages do not reason about nullness. Many languages that do still use null for other reasons. One simple example is using null for nullary constructors of algebraic data types (e.g. nil or None). But a more widespread example is initialization. For example, Kotlin is a mostly null-safe language but punts on nullness in a few spots. One in particular is initialization. When designing the language, we discussed ways to ensure null safety even during initialization and we found that it would be extremely cumbersome in order to support common patterns (particularly due to encapsulation and separate compilation) and with little performance benefit. So Kotlin's non-null types can actually be null during initialization. People trying to generate WebAssembly will likely have similar problems. Without advanced features like typestate, the common pattern of allocating a blank instance and then letting each function/module initialize the appropriate component won't type-check. This is even problematic for initializing runtime data structures. For example, a v-table is often initialized by passing the v-table to each function/module defining relevant superclasses, starting with the base class and moving down the hierarchy. Type-checking initialization such that the burden can be split across functions/modules is not easy.

skuzmich commented 4 years ago

@kripken we have more freedom optimizing non-nullable struct.get compared to knowing that loads in the program never trap, assuming that is what ignoreImplicitTraps assumes. For example, we can unconditionally execute conditional struct.get.

jakobkummerow commented 4 years ago

V8 has no plans to use signal handler tricks for null checks on any platform. (Reason: our null reference is a sentinel object somewhere on the heap, not a zero-valued pointer like C++'s nullptr, and I don't see that changing.) So any non-null-permitting operation (struct.get, ref.cast, ...) on a possibly-null value will emit an explicit null check (comparison + conditional branch, depending on circumstances possibly also materialization of the null sentinel constant to compare with).

I haven't measured what the performance impact of these checks is.

To me, the interesting (and hard-to-answer) question is how much guaranteed non-null-ness optimizing compilers will be able to infer by themselves (it's easy to come up with scenarios where it's trivial, as well as others where an engine can't possibly prove it, and in between there's the possible-but-difficult area), and how long it will take for engines to actually implement such optimizations. It is certainly much easier when the engine only has to look at the static type of the value in order to decide whether a null check is needed (which is what V8 currently does).

kripken commented 4 years ago

@jakobkummerow

our null reference is a sentinel object somewhere on the heap, not a zero-valued pointer like C++'s nullptr

Could that sentinel object be in its own OS page, and that page protected in the same way that linear memory is?

skuzmich commented 4 years ago

@RossTate I wouldn't worry about Kotlin. I think most of your concerns regarding initialization nullability is based on JVM implementation, which implies separate compilation and a bytecode format without sound type nullability. Many language features (like when-expressions, primary contractor properties, etc.) encourage immediate variable initialization: ~95% of Kotlin/JS standard library local variables are immediately initialized. Since separate compilation is not very viable in Wasm for many reasons, we can generate non-nullable Wasm types for most immediately-initialized non-nullable fields in a single whole-program compilation. Even if I'm wrong and we wouldn't be able to generate a lot of non-nullable types on average, I would argue, it is still important to give programmers a way to micro-optimize their perf-critical code by choosing the "efficient" initialization scheme. I will share concrete numbers once non-nullable type generation is done.

@jakobkummerow

V8 has no plans to use signal handler tricks for null checks on any platform. (Reason: our null reference is a sentinel object somewhere on the heap, not a zero-valued pointer like C++'s nullptr, and I don't see that changing.)

Is it because of interop with nulls from other subsystem, like JS?

To me, the interesting (and hard-to-answer) question is how much guaranteed non-null-ness optimizing compilers will be able to infer by themselves

A good solution would probably look like an inter-procedural analysis (and maybe a higher-level language type information), which would be more viable in AOT optimizers rather than Wasm JITs.

There are some optimizations, that are not representable in Wasm types, like multiple accessing the same object, where one access dominate others. Generating local as_non_null cast after first access can make things worse for non-optimizing wasm engine with signal handler trick, but it might be better for current V8 in some cases.

jakobkummerow commented 4 years ago

Could that sentinel object be in its own OS page, and that page protected in the same way that linear memory is?

I considered that idea before saying "I don't see that happening" :-) With actual 64-bit pointers it might be feasible. The main concern there is that we wouldn't want the thing to require special-casing all over the place (GC marking phase, write barrier, ...), so it must "look like a regular heap object" including a shape descriptor and a proper page header and whatnot. With pointer compression, however, I'm pretty sure it's just outright impossible.

Also, V8 implementation aside, IIUC memory protection tricks can only handle .set/.get, not ref.cast (which according to previous discussions is going to be a very frequent operation).

Is it because of interop with nulls from other subsystem, like JS?

"It's the null we already have" is certainly one reason; but due to the considerations above I'm also pretty sure we won't introduce a special "Wasm null" with a different implementation. (We may or may not introduce a special "Wasm null", but it would be implemented the same way as the existing JS null.)

Generating local as_non_null cast after first access can make things worse for non-optimizing wasm engine with signal handler trick, but it might be better for current V8 in some cases.

Agreed, our current prototype would achieve slightly better performance in some cases if you did that, but I would generally recommend that we first build things (in both compilers and engines) as simple as we can, then profile real-world use cases, and then decide where we need more complexity in order to improve things. Chances are that the situations where an X-to-Wasm compiler can easily emit such local optimizations are the same cases where a Wasm engine can fairly easily make the inference; so if/when we decide that this is worth improving, we should coordinate on the "how".

RossTate commented 4 years ago

Thanks for the useful information, @jakobkummerow! It's very interesting because it permits a bunch of flexibility. In particular, you can get from and set to fields (up to some offset) of a null reference, so long as you check that the reference is null before you take any observable actions based on what you got. Plus branch hinting/prediction is pretty effective. I imagine y'all have thoroughly explored the tradeoffs for JS, and given that JS is not a null-safe language or particularly easy to reason about, I would expect that it would have been very well positioned to benefit from trapping versus testing. So maybe that's a sign that testing is in fact not particularly expensive.

@skuzmich To clarify my perspective, I think there is potential value for non-null references in WebAssembly and that we should leave the door open for such an extension, but at the moment we have bigger concerns to address. For example, I am very concerned about your statement that WebAssembly is not currently viable for separate compilation. It will be easier to make the GC MVP support separate compilation if the MVP can—for now—assume all value types are defaultable. Non-null reference types are currently the only value type in any proposal that would break that simplifying assumption, and their advantages don't seem to merit the complexity—for now. So, for me, this issue is a matter of prioritization.

titzer commented 4 years ago

I think it is very likely that other Wasm engines will implement null checks with trap handling because they are so common. Wizard isn't at the stage yet where that optimization matters (not JIT yet), but it already does effectively do this, using an implementation-language null to represent a Wasm null, though that is packed into an ADT for the interpreter.

skuzmich commented 4 years ago

@RossTate thank you, I agree that simplicity is better than a small performance benefit for now. But some people here have concerns that performance benefits might actually be significant. Because at the end of the day we are trying to compete with JS as a compilation target, and performance is the biggest competitive advantage we can have. We need more measurements :)

Also, this probably deserves a separate discussion, but separate compilation is a very hard issue to solve:

  1. "Symbolic" struct field linking close to source language -- the thing we discussed the most -- easy to solve.
  2. Compiler optimizations -- for many languages separate compilation relies on JIT that can optimize across compilation units. Currently Wasm engines expect preoptimized code. Adding a bunch of good speculative optimization to all engines to be competitive with JS or JVM is a very hard task.
  3. Code size -- separate compilation has more unused stuff, that can be eliminated by DCE. Even in JS where 1. and 2. are solved issues, gold standard is to still use some bundler like webpack. I don't see how we can solve this issue with Wasm.
skuzmich commented 3 years ago

The main concern there is that we wouldn't want the thing to require special-casing all over the place (GC marking phase, write barrier, ...), so it must "look like a regular heap object" including a shape descriptor and a proper page header and whatnot.

@jakobkummerow I'm not familiar with the implementation, but would it be possible to allocate "null" object at the end of the page and protect the next page, so that it would have a proper accessible header, but user-level field offsets would be protected, enabling trap trick?

kripken commented 3 years ago

Thanks for all the comments so far, everyone!

Overall it seems like there may be a performance benefit here. We will just need to measure it to be sure. And it doesn't sound like there are any non-performance benefits.

If we don't get clear numbers on this being a useful speedup, then as @RossTate said, it may make sense to defer non-nullability for later (as long as we don't do anything to prevent it from happening later). I think that may have value since non-nullability does add some complexity here (which surprised me), in particular the new let is a new structural construct with some new features. But we can see what the performance numbers are first.

manoskouk commented 3 years ago

Hi all, To put some concrete numbers on the potential performance benefits of non-nullable types, I ran some experiments on our V8 prototype. Our prototype already optimizes away null checks if the static type of the object is known to be non-nullable.

I created a loop which repeatedly applies an operation requiring a null check on a local value. I measured the running time of the loop when the value was nullable vs. non-nullable. To isolate the effect of the operation from the loop jump and counter increment, I also ran an empty loop (which just increments a counter).

Note that, since the operation was always run on the same non-null object, the branch prediction should have hit every time. The results might be different otherwise.

Unfortunately, the benchmark is quite flaky, but these are the trends which I observed: For ref_test and ref_cast the normalized speedup (subtracting the running time of the empty loop) was about 13-19%. For struct_get and array_len it was about 33%. For struct_set about 50%. For array_get (with a constant index) about 67%. Of course, speedups were lower when the total time of the loop used instead, and ranged between 12-50%.

These numbers are of course specific to the situation of a tight loop with a specific payload, and are not likely to be measured in an actual program. However, I think they do show that non-nullability is an effective feature.

RossTate commented 3 years ago

This is not to detract from the discussions here and about let, but I am still concerned for the languages that were designed with the premise that null accesses would fault rather than have to be branched on, so I wanted to prompt some more discussion on that issue.

@skuzmich made a clever suggestion in https://github.com/WebAssembly/function-references/issues/40#issuecomment-733868920. I reread the blog on pointer compression, and I think it should be compatible with that. I suspect the GC already sets up some space for permanent objects created during initialization, so my guess is that you could implement this by putting the null value at the very end of that space and have a protected page immediately after it (or otherwise at the end of "old" object space). @jakobkummerow, what are your thoughts?

jakobkummerow commented 3 years ago

It might be possible to create such a null, yes. I'm still somewhat skeptical about it:

Also, in general, I wouldn't want the specification to rely on engines implementing one particular optimization technique. It's good to know that certain patterns can be optimized through advanced hackery, but it's also good to assume/allow implementation flexibility: what's viable in one engine today might not be viable in another engine, or in a future with different constraints and/or fundamental design choices, or on other platforms (e.g. 32-bit).

RossTate commented 3 years ago

Sounds good! I think that the current spec, because it allows struct.get and such on null references, does not force any one implementation strategy. And I think that generators of languages with frequent null references would have no need to change how they generate wasm code depending on whether they're targeting a faulting or branching implementation of null checks (provided streaming compilers for branching engines do basic null-pointer reasoning so that successive field accesses do not incur repeated checks, which I hear is the case). And having non-nullable types for languages with infrequent null references means they are not unnecessarily forced to rely on having a faulting implementation.

So, altogether, my sense is that the spec should not be changed assuming a faulting implementation of null, but that at the same time it would be good to explore such an implementation because languages like Java seem to stand to benefit from it regardless of the spec.

lars-t-hansen commented 3 years ago

The idea is fine (and not new I think; ISTR Lisp implementations have used this for nil in the past) but it is not performance-neutral. Accesses will be faster, but tests will be slower. As the bit pattern for the null value is now something specific rather than all-zeroes, the null value has to be loaded (from Tls, say) for anything that compares with null. Frequently the Tls pointer is in a register, so this is often just a single load, and it can be commoned if the value is used frequently, but then it occupies a register. In contrast, compare-with-zero-and-branch is fast and compact on most systems. Additionally, as Jakob says, the JS null will be a different value; this means significant complexity at the JS/wasm boundary. This worries me more, actually.

(Finally, the wasm null must be the same everywhere which means managing a shared resource somehow, but this seems like small potatoes.)

rossberg commented 2 years ago

Consensus is to include non-null types, so closing this.