Can the rt2 <: rt1 constraint on br_on_cast* instructions be relaxed?

tlively commented 1 year ago

Implementing the latest versions of the br_on_cast instructions in Binaryen, I realized that the rt2 <: rt1 constraint is slightly painful for us to accommodate. Binaryen discards the input type annotation after parsing and reconstructs it from the type of the operand when printing or emitting a module. Before the module is emitted, the optimizer may have been able to refine the type of the operand, which would result in a more refined input type annotation being emitted. However, the operand type may have been refined such that it is no longer a supertype of the cast target type. Obviously this means the cast must fail and can be optimized out, but it would be unfortunate if Binaryen were forced to do that optimization just to emit a valid module.

We could work around this by treating the input type of a br_on_cast instruction as the LUB of the actual input type and the cast target type, but that would lose precision in the output type, which could adversely affect other optimizations.

It would be easier, more efficient, and better for optimizations to simply drop the requirement that rt2 <: rt1 instead. AFAICT, the only downside of doing this is that we would need to replace it with a weaker constraint that rt1 and rt2 are in the same hierarchy (i.e. have a common supertype), which is slightly more complex to express.

@rossberg, WDYT?

Liedtke commented 1 year ago

To provide a second data point: For the V8 implementation the subtype restriction was easy to implement (it's still an additional check) but for our fuzzer which has to generate correct type immediates, it was surprisingly painful.

rossberg commented 1 year ago

It's possible. The main technical downside is that it is likely to complicate the computation of type differences in the future, e.g., if we are to add union types in some form or other refinements to the type hierarchy. Then that would turn into a lub/glb-like operation in the engine, which worries me slightly.

(An additional, social downside might be that it can cause more confusion about the role of casts -- some people already seemed to believe that they need to insert upcasts, too, which would then in fact be legal but pointless. But I don't think we ought to worry about that.)

tlively commented 1 year ago

Can you say more about how this might complicate things in the future? An example would help.

I agree we shouldn't worry too much about folks getting confused about when casts are necessary. Binaryen will optimize out obviously unnecessary casts and I expect folks will figure it out pretty fast when experimenting with code gen.

tlively commented 1 year ago

ping @rossberg. For now, I'll put this on the agenda for tomorrow's subgroup meeting.

rossberg commented 1 year ago

Imagine we have a union type A = C ∪ D and cast it to some B, so that we need to find A-B. With the subtype constraint, we know that B must be a subtype of either C or D (or both). To compute the difference, I expect we can just check subtyping for each individually and either keep that part of the union or remove it. Without the constraint, there may be cases where B only has a common subtype with e.g. C. We'd still need to remove it then. But checking whether there is a common subtype is no easier than computing that common subtype via the glb.

Another point that occurred to me and is more immediately relevant is that relaxing this constraint is making type information less precise, in ways that could be practically relevant. Types that are closer together generally require shorter code sequences than others (a consequence of our monolithic cast instruction). I assume that an engine wants to produce the shortest possible code. To do that, it will first need to find the smallest type from which to cast down -- in the absence of a subtype constraint, that is effectively the lub between source and target type. That is, the engine now has to compute the same lub that we just saved the consumer from computing. So in practice, this looks like it's shifting work from producer to consumer, which is the wrong direction. Plus, it introduces a form of lub computation through the backdoor.

conrad-watt commented 1 year ago

Just fyi I may not be able to make the meeting tomorrow, but I'd also be a little concerned about this change. The subtype constraint is the naively "correct" one, and relaxing it to something less regular as a snap judgement might come back to bite us later.

Could binaryen preserve the input type annotation in the IR? Presumably any refinements to the actual input type wouldn't affect the well-typedness of the instruction - the input annotation could be optionally refined, but if the aim is "minimum engineering time to emit valid code", emitting the original annotation would be sufficient.

EDIT: unless this would have implications for refinement of the type along the "cast fail" path...

tlively commented 1 year ago

Sorry @rossberg, I still don't understand your examples.

Without the constraint, there may be cases where B only has a common subtype with e.g. C. We'd still need to remove it then.

Here's the case analysis as far as I can figure it out for the unconstrained case in the presence of union heap types:

B is a strict subtype of C or D and unrelated to the other. Without loss of generality, assume B <: C. In this case, failure to cast to B does not rule out either C or D, so the type of the failure arm is A (i.e. C ∪ D).
B is a supertype of C or D or both. Without loss of generality, assume B :> C. If the value is a C, the cast definitely succeeds, so we can remove C from the type of the failure arm, leaving only D. Each type in the union can be independently considered for exclusion in the same way. In fact, if we relax the subtyping constraint in the MVP, we could already express this in the definition of type difference and have the failure arm return bottom when casting to a supertype.
B is otherwise related to both C and D. If B is a subtype of one and a supertype of the other, then without loss of generality, assume C <: B <: D. Then C ∪ D = D, so I assume this union type would be either invalid or an alias for D, in which case this degenerates to the MVP case. Otherwise, B must be a subtype of both C and D, in which case this similarly degenerates to the MVP case, assuming we don't allow multiple supertype definitions because that would break principal typing.
B is unrelated to both C and D. The cast cannot succeed, so the success arm can be typed with bottom and the failure arm can be typed with A (i.e. C ∪ D).

Taken altogether, a more powerful definition of type difference for the MVP would be:

(ref null ht1) \ (ref null ht2) = (ref ht1 \ ht2) (ref null1? ht1) \ (ref ht2) = (ref null1? ht1 \ ht2) ht1 \ ht2 = ⊥, if ht1 <: bt2 ht1 \ ht2 = ht1, otherwise

Note that this definition is more precise than the current definition for ht1 = ht2, so it is an improvement even if we continue requiring ht1 :> ht2. Also note that refining the success arm to ⊥ when ht1 and ht2 are unrelated requires a larger refactoring of typing rule because the success arm does not use type difference. Assuming the subtype check is constant time, this more powerful MVP type difference would be constant time.

Extended in the presence of union heap types, heap type difference would be:

(ht1_1 ∪ ... ∪ ht1_n) \ (ht2_1 ∪ ... ∪ ht2_m) = (ht1_1 \ ht2_1) ∪ ... ∪ (ht1_1 \ ht2_m) ∪ ... ∪ (ht1_n \ ht2_1) ∪ ... ∪ (ht1_n \ ht2_m)

This would have time O(nm), which seems as good as you could hope for with union types.

But checking whether there is a common subtype is no easier than computing that common subtype via the glb

But glb(A, B) can be computed in constant time, since it is either A if A <: B or B if B <: A or ⊥ otherwise. Again, this is assuming we don't have multiple supertypes because that breaks principal typing.

Types that are closer together generally require shorter code sequences than others (a consequence of our monolithic cast instruction)

Under our common assumptions about constant time cast implementations so far, I don't see how this is true.

@conrad-watt,

Your edit is correct that maintaining the unrefined input type would make the output type less precise. The goal is not just to save engineering effort, but to save engineering effort while still using the most precise type information possible for optimizations.

rossberg commented 1 year ago

Yeah, sorry, I screwed up the example. However, note that under the current subtype constraint, the only possible version of your cases (2) and (3) is that B = C, which is a simpler condition. So right now, the difference function would only need to search for equal types in a union, not perform any subtyping check.

Note that this definition is more precise than the current definition for ht1 = ht2, so it is an improvement even if we continue requiring ht1 :> ht2.

Ah, interesting observation. I guess we could implement this refinement. Though it’s probably not worth it, because this case only arises from pointless casts.

But glb(A, B) can be computed in constant time, since it is either A if A <: B or B if B <: A or ⊥ otherwise. Again, this is assuming we don't have multiple supertypes because that breaks principal typing.

I think you are extrapolating too much from the current state of affairs. For example, generics change the picture, because subtype checks become linear in the number of their parameters. And we may have multiple super types in the future, if they can be restricted in some ways to avoid these problems.

I don’t want to make predictions, but it is dangerous to assume that everything remains as simple as it is now.

Under our common assumptions about constant time cast implementations so far, I don't see how this is true.

Even constant time does not imply equal cost. For example:

Casting (ref any) to (ref $t) is cheaper than casting (ref null any) to (ref $t), because we don’t have to test for null.
Casting (ref struct) to (ref $t) is cheaper than casting (ref any) to (ref $t), because we don’t have to test for the shape of the object.

where “cheaper” means both less code and fewer instructions executed.

This is just for what we have now. With more features there will likely be more differences, and some cases may not even be constant time anymore, see above. I assume engines still want to find the shortest code sequence in each case.

tlively commented 1 year ago

With regards to the backdoor LUB calculation you were concerned about, calculating a precise LUB is unnecessary. It suffices to find the least abstract upper bound, which is much simpler. This is because increasing the precision of a concrete upper bound does not make the cast any simpler. (Again, under current assumptions about how casts are implemented)

In any future extension, the only possible relations between the input and output type are strict subtype, strict supertype, equal, or unrelated. It seems implausible to suggest that evaluating this relation will be prohibitively expensive, given how frequently validation needs to check subtyping already, especially when the evaluation can be boiled down to checking just ht1 <: ht2 and ht2 <: ht1

conrad-watt commented 1 year ago

Your edit is correct that maintaining the unrefined input type would make the output type less precise. The goal is not just to save engineering effort, but to save engineering effort while still using the most precise type information possible for optimizations.

Could Binaryen track the original input type annotation, but only emit it if the refined input type falls outside rt2 <: rt1?

i.e. use the refined input type wherever possible, otherwise default back to the original annotation?

EDIT: if the type refinement happens in stages, causing the "cast fail" path to be repeatedly refined, the "tracked" input type annotation could also be updated to record the last safe refinement.

rossberg commented 1 year ago

If an engine has to code-gen for (ref.cast t1 t2), then it knows that the input has type t1. So the actual code sequence it wants to generate a check for is whatever aspect of t2 is not subsumed by t1. In general, that is exactly what (ref.cast lub(t1,t2) t2) would check for, so the jit will implicitly need to compute that or some approximation. In the current system, a cheaper approximation that is restricted to an abstract type may suffice, but that is unlikely to hold up under interesting extensions like generics or unions, so it would seem unwise to build it in as a design assumption.

We have gone to lengths to avoid lubs/glbs in the past, because of the potential cost. This change does not seem important enough to turn around on that strategy. Especially since it's just work that the producer could already do offline. If that work is easy, then it should be easy for the producer as well. If not, even more reason the producer does it.

tlively commented 1 year ago

At least the LUB calculation would be an optional optimization. Baseline compilers certainly wouldn't be forced to implement it if the cost isn't justified, and optimizing tiers already do much more complex calculations.

To comply with the subtype constraint when emitting modules, binaryen would have to:

Take the LUB of the refined input type and the output type and make that the new input type
Either:
- Insert another cast down to the original, refined input type on the failure arm
- or drop the result on the failure arm and use a scratch local and possible additional non-null cast to propagate the original value instead

Without the second step, we might emit an invalid module because we might have optimized the failure arm to expect the more refined type.

Is maintaining the subtype constraint really worth forcing us to insert extra casts? I would hope not.

conrad-watt commented 1 year ago

@tlively does the above suggestion I sketched, where Binaryen associates an input annotation to each cast (possibly progressively updated), not work?

In your sketch above, Binaryen has already propagated the refined input type through the whole failure arm, requiring post-hoc insertion of downcasts, but surely there's some mechanism to prevent this propagation in the first place if it's the result of an ill-typed (in the current setting) input to the downcast operation - either by tracking the last "good" refinement as I suggested above, or through some other approach.

I don't think that our first solution for this kind of implementation speed bump in Binaryen should be a core spec change (that has to be propagated through other engines, tests etc), especially one that potentially has long-term ramifications for our object type system. We can always relax our type system later, but we can't un-relax it.

rossberg commented 1 year ago

To be honest, I am a bit puzzled that, assuming the input is well-typed, optimising it can lead to either ill-typed casts or requires the introduction of additional casts. If that's the case, then there seems to be a deeper issue with some of the transformations Binaryen is applying. Wouldn't it imply that these optimisations aren't type-preserving, i.e., unsound in general? Can you give an example of how that could happen?

When targeting a typed language it is a common hazard to throw away too much type information (e.g., the OCaml Wasm backend is dealing with similar problems). It sounds like Binaryen is trying to avoid keeping around the second type annotation we now have, perhaps that's part of the issue?

conrad-watt commented 1 year ago

@rossberg I think @tlively's scenario is something like this:

A is a supertype of both B and C, but B and C have no subtyping relation to each other

The input code contains a downcast from A to B

After performing some optimisations, Binaryen refines some types and realises that the input to this downcast is really of type C

The trick is that Binaryen doesn't track the original input annotation to the downcast, and assumes in general that propagating refined types through all paths is "safe". So when it emits the actual downcast instruction, it reconstitutes the input type annotation according to what it "knows" is on the stack (i.e. C). In addition, the failure case of the cast may have also been refined to only type check if a C is kept on the stack - so by this point it's too late to just pick a less precise input type annotation.

As I sketched above, I think (in our current setting) propagating the refined type through to the failure arm of a cast is only actually safe if the refined type would not cause the cast instruction to be ill-typed.

rossberg commented 1 year ago

Right, but had it kept A around there wouldn't be a problem, right? It would always remain correct to cast from there. It could try to refine that type, too, but with A still around, it would have enough information to refine it in a valid way.

tlively commented 1 year ago

Yes, that would keep everything correct, but it would lose type information on the failure arm of the branch, which would be typed with A, while we statically know that it's actually a C. With the relaxed typing rule, we can annotate the input with C instead of A and therefore get the more precise output type on the failure arm as well.

We discussed this in the subgroup meeting this morning and realized that ref.cast already has to do the nontrivial codegen calculations that @rossberg was trying to avoid br_on_cast having to do. In particular, since ref.cast only has an output type annotation, producing an optimal cast implementation already requires taking (an over-approximation of) the LUB of the input and output types. For parity with ref.cast, we agreed that we could relax the subtyping constraint on br_on_cast.

We also agreed that when the type system becomes more complex and the computations required for optimal cast codegen become more complex, it will make sense at that point to add variants of the cast instructions with additional annotations (e.g. giving an upper bound of the input and output) to help the engine produce optimal code with less work.

We did not discuss whether we wanted to make any refinements to the computation of output types on either the failure or success arms of the branch as I described in https://github.com/WebAssembly/gc/issues/381#issuecomment-1588177954. I do think it would be good idea to do so, since more precise type information always helps produce better code. Of course in the future, we would not want to make the computation so powerful that it becomes too expensive, but that's not a problem for the MVP.

bashor commented 1 year ago

Because of this restriction, we can't use types from different hierarchies (internal, external, functions), unlike ref.cast.

E.g., we can't use br_on_cast_fail to cast externref to some specific struct type, we have to internalize it before.

Liedtke commented 1 year ago

@bashor Casting between type hierarchies should not be supported by ref.cast either. The only difference should be that casting between unrelated types of the same type hierarchy is valid, e.g. ref.cast null array (ref.null struct).

rossberg commented 1 year ago

@bashor, casting between difference hierarchies is not supported anywhere. Allowing it would kill every advantage of having separate hierarchies in the first place, in terms of enabling incompatible runtime representations for their references. Keep in mind that casts merely perform subtyping checks, not representation changes.

bashor commented 1 year ago

@Liedtke @rossberg Yeah, sounds reasonable, thanks!

rossberg commented 1 year ago

While starting to update the spec for this, I realise I still have questions about it.

With the relaxed typing, there will be three disjoint cases of what a br_on_cast could mean:

rt1 > rt2. Then it actually is a downcast, the intended use case.
rt1 ≤ rt2. Then it is a redundant upcast, and therefore just an obfuscated way of doing br, but with a lot of dead code added (extra blocks and such).
Neither. Then the cast cannot possibly succeed, and it’s just an obfuscated way of doing nop, with a lot of dead code added (extra blocks and such).

I believe it is fair to assume that no serious codegen or optimiser would ever want to generate cases (2) and (3). They are practically useless in real code. Even Binaryen only wants them internally IIUC.

My question then is: Couldn’t Binaryen just relax its internal typing rules as per this change? Why make it a feature of the external language that (a) no producer would want to use, but (b) every consumer would have to deal with?

Instructions should exist to perform specific operations, i.e., provide runtime behaviour. Casts are already the most complex operators in all of Wasm, even when they are limited to case (1) above. This change is sweepingly broadening their semantics even further, without adding any useful runtime behaviour at all. Ultimately, it is complicating the language to optimise for a producer shortcut that

only works by accident,
is future-hostile to more expressive casts,
is useless in real code,
shifts work to distinguish the above cases to engines.

All this makes me rather uncomfortable. Am I missing some compelling reason that trumps all of the above?

(FWIW, ref.test/cast has been brought up, where these cases are already possible. But IMO there is an important difference: for those, the extra cases only arise as a curious side effect of subsumption, the semantics does not recognise them. In contrast, with the change to br_on_cast, we would be exposing the useless extra semantics explicitly, as an intentional feature, which has a different quality.)

Liedtke commented 1 year ago

@rossberg Just to clarify: Is a br_on_cast valid if source type immediate == target type immediate? From rt2 <: rt1 and

Type indices are subtypes if they either define equivalent types or a suitable (direct or indirect) subtype relation has been declared

I assumed that it is valid as equivalent types are considered to be subtypes of each other if I understand it correctly. So at least right now V8 explicitly allows br_on_cast for same types which means depending on the fail check either the branch target or the fall through is dead code.

I guess, one could change the definition to rt2 <: rt1 and not rt1 <: rt2 (?) but I'm not convinced that this is a simplification. Also, the same is already true for ref.cast and ref.test where our decoder will specifically handle always succeeding and always failing checks to generate less / better code. Due to implicit upcasts on ref.cast and ref.test and explicit immediates on br_on_cast, this is already somewhat inconsistent and by relaxing the constraints on the immediates these instructions get more similar to each other. (A ref.cast could be seen as a br_on_cast fail with an engine defined branch target, so these instructions are semantically close to each other.)

tlively commented 1 year ago

Couldn’t Binaryen just relax its internal typing rules as per this change?

Conrad gave a good explanation in https://github.com/WebAssembly/gc/issues/381#issuecomment-1589598628. In particular, this is the troublesome part:

In addition, the failure case of the cast may have also been refined to only type check if a C is kept on the stack - so by this point it's too late to just pick a less precise input type annotation.

So if we just make this change in Binaryen and not the spec, it would not suffice to just fix up the type annotations in the binary writer.

We also could not just change what branch instruction we emit to fix it up, either. In the case of br_on_cast (ref null $A) (ref null $B), where $A and $B are unrelated, the closest we could get would be to use a br_on_null and add a ref.null at the target, but that would involve changing the target block type, which is infeasible both because we already would have emitted the block header at that point and because there may be other branches to the same block that would be invalidated if we changed its type.

If we don't make this relaxation, I still don't see any way to handle that besides losing type precision and optimization power in Binaryen.

Instructions should exist to perform specific operations, i.e., provide runtime behaviour.

I agree, but in this case that is in tension with another goal: instruction types should avoid unnecessary loss of type precision. Yes, this only affects situations that reasonable producers would strive to avoid, but in general we try not to disallow or pessimize unreasonable situations, e.g. unreachable code or uninhabitable types.

Why make it a feature of the external language that (a) no producer would want to use, but (b) every consumer would have to deal with?

For the same reason we allow unreachable code: it's better in the end to treat useful and useless things as uniformly as possible.

In contrast, with the change to br_on_cast, we would be exposing the useless extra semantics explicitly, as an intentional feature, which has a different quality.

But the semantics for br_on_cast can ignore the input type annotation completely, just like the semantics for ref.cast can ignore the precise static type of its input completely. The input type annotation exists only to simplify the calculation of the output type and support our principal types design goal, not because it is semantically meaningful. If we didn't want to enforce principal typing, we wouldn't have that input type annotation, and we would not be having this conversation because br_on_cast would work exactly like ref.cast.

rossberg commented 1 year ago

@tlively, my understanding was that Binaryen will want to eliminate cases (2) and (3) anyway, i.e., never generate them in the actual output for any engine to see. If that is the case, then I'd think this change isn't needed in the language proper.

If, on the other hand, Binaryen does output such redundant casts, then we are basically shifting the burden of differentiating these cases and picking the right operational behaviour from producer to consumer. That seems to be in contrast to the principles we usually apply for Wasm. So that shouldn't be a good motivation either.

Your point regarding null is an interesting one, though, which I hadn't thought of – it implies even more complexity than my simple case analysis suggests. But I believe you should be able to fix that case up locally by replacing br_on_cast with (ref.null $B) (arg) (br_on_null $l) (drop), i.e., turn the null needed at the destination into a regular branch argument.

If we didn't want to enforce principal typing, we wouldn't have that input type annotation, and we would not be having this conversation because br_on_cast would work exactly like ref.cast.

Fair point, but it also wouldn't have come up if we hadn't taken a wrong turn with casts. ;) My concern is that by depending on this change we're digging ourselves an even deeper hole. I don't see how Binaryen's strategy can possibly work with additions like generics.

@Liedtke, I agree, though I don't think engines should spend time on optimising ref.test/cast in the way you describe. According to Wasm's design goals, producers can and should do such optimisations themselves. And similarly for br_on_cast, which is why I am concerned that this change is creating the wrong incentives. In retrospect, I would prefer all current casts to have two annotations.

tlively commented 1 year ago

@rossberg, nice, that sequence with br_on_null looks like it would work. Unfortunately, it will require more engineering effort in Binaryen than you would expect due to the br_on_null consuming multiple arguments, but that's probably not a huge problem.

The motivation for relaxing this constraint was that it was hard to work with in Binaryen. Since we've potentially found a solution for that, I'd be happy to un-relax the constraint once we can validate the solution in Binaryen as long as no one else objects.

Fair point, but it also wouldn't have come up if we hadn't taken a wrong turn with casts.

Restricting these cast instructions to take non-abstract input types as they did before the change you're referring to would not have affected this discussion at all.

rossberg commented 1 year ago

@tlively, cool, sounds good.

FWIW, I believe the above substitution works even with additional arguments; they all go before. The main complication I see is that the null has to go before the ref operand, which may be produced by an arbitrary sequence of instructions.

tlively commented 1 year ago

I realized that there is a completely trivial fix to the problem of cast inputs getting refined such that they are no longer supertypes of the cast target type. When the input is refined, we can fix up the cast target type to be the greatest lower bound of the original target type and the newly refined input type. This does not change the cast behavior and it maintains the invariant that the target type is a subtype of the input type.

Since there is such a trivial workaround after all, I will close this issue and we can keep the rt2 <: rt1 requirement in the spec.

WebAssembly / gc

Can the rt2 <: rt1 constraint on br_on_cast* instructions be relaxed? #381