Closed ngzhian closed 1 year ago
@yurydelendik @Maratyszcza
Nice. The 0x80000000
and 0xfffffffe
are now included.
Thanks.
either/or is a construct introduced by the threads proposal https://github.com/WebAssembly/threads/blob/main/interpreter/text/lexer.mll#L455 would you prefer to have a comment in the tests referencing the threads proposal?
Thanks yeah, please add it so its easier to find.
This PR was landed without landing #140, which updates the test suite. Moreover, this PR does not actually match the changes to the test suite in #140, see my comments there. The overview has never been updated either.
I'm a bit concerned about all this. The relaxation spec'ed here looks very ad-hoc and seems to cover some rather arbitrary implementation artefacts. It's not clear why these should be simultaneously okay and also exhaustive, or what the driving principle is. At the same time, the test suite seems to be used without the updated tests, so neither does this relaxation seem essential, nor has it been tested in the wild.
As a result, spec and tests are simply inconsistent at the moment, while it's unclear which version is more relevant. That suggests that more discussion is needed, and perhaps this spec change should be rolled back unless that is properly resolved?
This PR was landed without landing #140, which updates the test suite. Moreover, this PR does not actually match the changes to the test suite in #140, see my comments there. The overview has never been updated either.
I can take a look and update the overview.
I'm a bit concerned about all this. The relaxation spec'ed here looks very ad-hoc and seems to cover some rather arbitrary implementation artefacts. It's not clear why these should be simultaneously okay and also exhaustive, or what the driving principle is. At the same time, the test suite seems to be used without the updated tests, so neither does this relaxation seem essential, nor has it been tested in the wild.
Because neither the threads proposal (where the syntax was used) and nor this proposal were merged previously, so engines have to do something custom to support running these tests anyway. The thing that happened though is that the threads proposal no longer uses the 'EITHER' syntax, and this was definitely wonky to work around to begin with. The spec tests as they are likely need to be refactored to be more intuitive to work with.
As a result, spec and tests are simply inconsistent at the moment, while it's unclear which version is more relevant. That suggests that more discussion is needed, and perhaps this spec change should be rolled back unless that is properly resolved?
Perhaps we could resolve the open bits or pair down the tests in #144 to just cover what's represented in the spec change and then merge that one instead? cc@yurydelendik
@dtig, that sounds okay, but I still wonder whether there should be a more thorough discussion. These changes appear to have happened rather last-minute, without much feedback from even the subgroup. They almost look like just waving through whatever artefacts a couple of hand-selected implementations happened to produce first.
If we consider the additional changes from #140, then the set of behaviours already becomes so numerous and selectively ad-hoc that it would be cleaner and more future-proof to simply say that the result for the edge cases of the trunc instructions could be anything, i.e., is plain non-deterministic. Is there a practical reason why that would not be sufficient?
Re the either syntax: Can you elaborate what the problems were, so I can address them? On the wasm-3.0 branch I hope it works fine, I recently fixed a couple of bugs in its translation to JS, in case that was an issue. The state of the threads proposal is a whole different story. :)
@dtig, that sounds okay, but I still wonder whether there should be a more thorough discussion. These changes appear to have happened rather last-minute, without much feedback from even the subgroup. They almost look like just waving through whatever artefacts a couple of hand-selected implementations happened to produce first.
Thinking back, the reason I personally thought this was fine to merge was (a) proposal is niche enough that we don't expect several implementations to want to target it. (b) this proposal introduces an inherent tension between the specification and the implementations because the ideal thing for the specification is to underspecify in case of diverging behaviors, but implementations would want to be consistent with each other for portability of applications.
If we consider the additional changes from #140, then the set of behaviours already becomes so numerous and selectively ad-hoc that it would be cleaner and more future-proof to simply say that the result for the edge cases of the trunc instructions could be anything, i.e., is plain non-deterministic. Is there a practical reason why that would not be sufficient?
The practical reason is specific to browser VMs (and maybe can be extended to other implementations too?), and is that they need to be interoperable with each other, so even if the spec is underspecified, implementations should still have the same results on the same hardware. Now whether these tests should be in the spec repository is something we could discuss but we definitely need tests to hedge against generating the wrong (but still allowed by the spec) codegen on a particular hardware.
Re the either syntax: Can you elaborate what the problems were, so I can address them? On the wasm-3.0 branch I hope it works fine, I recently fixed a couple of bugs in its translation to JS, in case that was an issue. The state of the threads proposal is a whole different story. :)
Oh the wonky bit was that the tests imply that the set of allowable values is smaller than it actually is whicch probably is another reinforcing factor for having a broader discussion about what should live in the spec tests. I also remember this not working out of the box, but that might have been because there are slight differences between V8's infra for running spec tests - I can try again with the Wasm 3.0 branch and report back.
@dtig, hm, I'm afraid I'm even more confused now.
Thinking back, the reason I personally thought this was fine to merge was (a) proposal is niche enough that we don't expect several implementations to want to target it.
Once standardised, everybody has to implement a feature according to the specification. How does it matter that it's niche? Are you implying that relaxed SIMD is too niche to be standardised?
(b) this proposal introduces an inherent tension between the specification and the implementations because the ideal thing for the specification is to underspecify in case of diverging behaviors, but implementations would want to be consistent with each other for portability of applications.
What is the purpose of having a specification and standard, other than defining what can be relied upon across all implementations?
If there is a tension, then either (a) the requirement is not realistic, or (b) its specification is not adequate, or (c) implementations are lazy. Either way, something has to be done about it.
The practical reason is specific to browser VMs (and maybe can be extended to other implementations too?), and is that they need to be interoperable with each other
Why would you say that this is only relevant to web engines? Portability is a — if not the — central value proposition of Wasm, and should be considered equally relevant to other ecosystems.
so even if the spec is underspecified, implementations should still have the same results on the same hardware. Now whether these tests should be in the spec repository is something we could discuss but we definitely need tests to hedge against generating the wrong (but still allowed by the spec) codegen on a particular hardware.
It did not occur to me that cross-implementation consistency of relaxed behaviour was an intended requirement. The overview literally says it "depends on the implementation", which suggests the contrary. In fact, my recollection is that we concluded that you cannot even rely on the same implementation producing the same results between multiple runs. And indeed, it's unclear how "hardware-defined" behaviour could even be specified (what is "equal" hardware?), or enforced with tests, or make sense at all given practices like virtualisation.
Regardless, the current spec and the tests don't achieve that either. AFAICS, enumerating all known behaviours doesn't contribute to addressing this requirement. So in that regard, nothing changes, no matter whether spec and tests allowed 4, 8 or 2^128 edge case behaviours.
We're talking past each other, this is likely a much easier over a high bandwidth discussion. There's a SIMD sungroup meeting scheduled on Nov 22nd, would that work for you?
@dtig, hm, I'm afraid I'm even more confused now.
Thinking back, the reason I personally thought this was fine to merge was (a) proposal is niche enough that we don't expect several implementations to want to target it.
Once standardised, everybody has to implement a feature according to the specification. How does it matter that it's niche? Are you implying that relaxed SIMD is too niche to be standardised?
I'm not implying that relaxed SIMD is too niche to be standardized - I'm implying that it's possible that not all implementations will care about the relaxed semantics or the proposal and fall back to the strict semantics if needed.
(b) this proposal introduces an inherent tension between the specification and the implementations because the ideal thing for the specification is to underspecify in case of diverging behaviors, but implementations would want to be consistent with each other for portability of applications.
What is the purpose of having a specification and standard, other than defining what can be relied upon across all implementations?
If there is a tension, then either (a) the requirement is not realistic, or (b) its specification is not adequate, or (c) implementations are lazy. Either way, something has to be done about it.
The specification yes, but I thought this discussion was focused on the tests - or are you considering them both to be the same? My opinion was always that (b) the specification isn't adequate to explain the nuances of hardware support for this proposal. But after several CG discussions, the vote was in favor of under specifying relaxed operations and I'm unwilling to participate in relitigating that particular discussion.
The practical reason is specific to browser VMs (and maybe can be extended to other implementations too?), and is that they need to be interoperable with each other
Why would you say that this is only relevant to web engines? Portability is a — if not the — central value proposition of Wasm, and should be considered equally relevant to other ecosystems.
Applications will still be portable, because the hardware edge cases aren't something applications rely on (which is also what the goal of this proposal is to relax). But implementations do care about generating the same code on the same hardware.
so even if the spec is underspecified, implementations should still have the same results on the same hardware. Now whether these tests should be in the spec repository is something we could discuss but we definitely need tests to hedge against generating the wrong (but still allowed by the spec) codegen on a particular hardware.
It did not occur to me that cross-implementation consistency of relaxed behaviour was an intended requirement. The overview literally says it "depends on the implementation", which suggests the contrary. In fact, my recollection is that we concluded that you cannot even rely on the same implementation producing the same results between multiple runs. And indeed, it's unclear how "hardware-defined" behaviour could even be specified (what is "equal" hardware?), or enforced with tests, or make sense at all given practices like virtualisation.
Regardless, the current spec and the tests don't achieve that either. AFAICS, enumerating all known behaviours doesn't contribute to addressing this requirement. So in that regard, nothing changes, no matter whether spec and tests allowed 4, 8 or 2^128 edge case behaviours.
I am talking about both tests and specification, and their relation to each other and to implementations.
The problem isn't that the spec is underspecifying, but that it apparently is not underspecifying enough. It is a bug (in either spec or implementations) if the spec allows less than what implementations do in the wild. But judging from #140, such an inconsistency is exactly the situation we are in right now: implementations do something that is not covered by the spec, even after this PR. And that needs fixing.
Hence, either we are willing to force implementations to comply with what is actually specified right now, or we need to relax the spec further. I would expect you'd rather do the latter.
That's why I suggested to conservatively make it ultimately relaxed by just wholly under-specifying the result for the trunc edge cases in question. Because real-world behaviour appears to be diverse enough that any attempt to enumerate all possibilities will never be useful nor exhaustive wrt to possible future implementations (and further relaxing it in the future to bless new implementations would technically be a breaking change, so isn't an option).
One nuance here is that the JS/web specs can impose additional restrictions, and consequently, JS/web tests may reject more behaviours than the core tests. But they can never allow more, as that would obviously be a violation of the core standard. So the kind of solution you seem to be envisioning with moving the more relaxed tests elsewhere does not fix the situation.
There's a SIMD sungroup meeting scheduled on Nov 22nd, would that work for you?
Hm, that's Friday night over here, I'm afraid that doesn't work for me. But I could make any other day.
These values show up due to faster algorithms used for relaxed trunc unsigned when hardware does not support single-instruction lowering.
See https://github.com/WebAssembly/relaxed-simd/pull/140 for more discussions.