bytecodealliance / wasmtime

A fast and secure runtime for WebAssembly
https://wasmtime.dev/
Apache License 2.0
15.34k stars 1.3k forks source link

Support WebAssembly memory instructions on big-endian platforms #2124

Closed uweigand closed 3 years ago

uweigand commented 4 years ago

Feature

Current wasmtime maps the WebAssembly memory instructions (t.load, t.store etc.) directly to Cranelift IR memory instructions (load, store, uloadN, etc.).

This causes problems on big-endian platforms, because the Cranelift IR instruction are implemented as native load and store instructions using the machine byte order, while the WebAssembly memory instructions are specified to use little-endian byte order always.

Now, I initially thought that one way to solve this problem could be to treat Cranelift IR memory instructions also as always little-endian by specification. However, that does not work, because there are many other uses of these instructions that do require native byte order.

Some examples of those include:

In addition, there are cases where -while not strictly necessary for correctness- it is preferable for performance reasons to use native byte order, e.g. for spill code, for accessing variables on the stack, when implementing code such as inlined copies for small memcpy etc.

So, I believe we need some way of representing both always-little-endian memory operations (used to translate the WebAssembly instructions), and native memory operation (used for everything else).

Benefit

Enabling support for Wasmtime on big-endian platforms like IBM Z.

Implementation

My current implementation of this approach simply duplicates all Cranelift IR memory instructions to create always-LE versions. So in addition to "load" there is "load_le" etc. The full list is: load_le load_le_complex store_le store_le_complex uload16_le uload16_le_complex sload16_le sload16_le_complex istore16_le istore16_le_complex uload32_le uload32_le_complex sload32_le sload32_le_complex istore32_le istore32_le_complex

Advantages of this approach include:

But there are disadvantages:

Alternatives

There's various alternative ways this could be implemented:

A) Add an additional flag argument to load/store instructions that specifies the requested byte order. A detail question is whether the flag is little-endian vs. native little-endian vs. big-endian little-endian vs. big-endian vs. native

Advantages:

Disadvantages:

B) Add an additional flag bit to the existing MemFlags Advantages:

Disadvantages:

C) Open-code the conversion in the WebAssembly translator Only emit a plain "load" if the target is little-endian. Otherwise emit a load followed by a byte-swap (possibly followed by an extension). Vice-versa for stores. This would probably require addition of a new "bswap" Cranelift IR instruction, unless we want to open-code bswap itself as well (possible, but a bit tedious).

Advantages:

Disadvantages:

uweigand commented 4 years ago

@cfallin @julian-seward1 - opened issue as discussed in our call today.

uweigand commented 3 years ago

See the PR I just posted. This is not necessarily ready to go in as-is, it is intended to showcase a possible implementation.

The PR implements a variant of method (C) above, where I'm creating just two new Cranelift IR opcodes, LoadRev and StoreRev - load reverse and store reverse. (These map directly to instructions IBM Z has, and a number of other platforms have as well.)

Note that there are no extending load / truncating store versions of those (I'm not aware of any ISA that directly has them, and in any case any new back-end could just match the combination if necessary). Likewise, there are no _complex versions, those are no longer really used with new back-end anyway.

fitzgen commented 3 years ago

Thanks for exploring the design space and laying out our choices @uweigand.

I would personally be in favor of alternative A. I think having clif's semantics defined irrespective of the current target is valuable and will make hacking on the compiler easier.

(I think B is also a strong contender, but A will be easier from an engineering perspective, since we can leverage compiler errors to automatically tell us exactly which code needs updates, rather than manually hunting down places where mem flags are silently dropped and/or created with defaults.)

Although I don't strongly hold this opinion, I'm leaning towards only discriminating between big and little endian, not including a native endian option.

If there is no "native" flag setting, all those updates must include finding out the native byte order somehow.

We can expose the target's endianness relatively easily, since our target-lexicon crate already has this information, and the cranelift-frontend builders can implement a load-with-native-endian builder method.

https://docs.rs/target-lexicon/0.11.1/target_lexicon/enum.Endianness.html

cfallin commented 3 years ago

+1 to the combination of two choices:

In other words we want existing users of the builder API to continue to see the same semantics, and new builder methods can take an Endianness. Removing the "native" option at the CLIF level fully closes the target-specific semantics hole, which is nice!

uweigand commented 3 years ago

I've made an initial attempt at implementing something along those lines here: https://github.com/uweigand/wasmtime/pull/1

Note that this is currently not ready to even attempt merging upstream, in particular because the addition of the extra endianness member to various clif instructions makes the InstructionData structure exceed its target size of 16 bytes.

So either we find some way to encode InstructionData more efficiently (but that would need advice from somebody much more familiar with Rust than I am ...), we accept the (significantly) memory consumption, or else we go back and try something else.

However, adding the explicit endianness member also makes the code more verbose, if you look at the patch above you'll see a large number of mostly "boilerplate" changes. As an aside, I have not been able to implement this suggestion:

Make the current load-instruction builders choose the target's native endian.

as is (or maybe I misunderstood it). The builders that are used by most of the code are the ones automatically generated by the "meta" machinery, and not only did I not find an easy way to provide a default value for any member, but the builders don't seem to even have the required information (they don't have access to the TargetIsa or triple or anything else to determine the default target endianness, unless I'm missing something). That said, changing the callers to provide an explicit endianness (usually retrieved from the TargetIsa) was mostly straightforward.

As a result of those issues with adding and explicit endianness member, I've been thinking of maybe going back to option B and encoding endianness with the MemFlags. This solves the InstructionData size problem, and also removes most of the boilerplate code changes. I was originally sceptically about option B due to the potential issue of introducing errors by accidentally dropping MemFlags during optimizations. However, I think this problem can be fixed: all builders of load/store instructions already make it mandatory to provide a MemFlags value. This can be either copied from another instruction, or else created afresh by starting from MemFlags::new() or MemFlags::trusted(). Now, if we simply added a mandatory endianness argument to all such MemFlags constructors, then we could easily enforce that every copy of MemFlags always contains target endianness information, and thus this cannot be silently dropped.

I'll see if I can come up with another implementation along those lines.

Finally, I'd appreciate any further comments on the patch above. In particular, there are some user-visible changes to the textual representation of clif instructions - please let me know if this approach looks good:

fitzgen commented 3 years ago

As a result of those issues with adding and explicit endianness member, I've been thinking of maybe going back to option B and encoding endianness with the MemFlags. This solves the InstructionData size problem, and also removes most of the boilerplate code changes. I was originally sceptically about option B due to the potential issue of introducing errors by accidentally dropping MemFlags during optimizations. However, I think this problem can be fixed: all builders of load/store instructions already make it mandatory to provide a MemFlags value. This can be either copied from another instruction, or else created afresh by starting from MemFlags::new() or MemFlags::trusted(). Now, if we simply added a mandatory endianness argument to all such MemFlags constructors, then we could easily enforce that every copy of MemFlags always contains target endianness information, and thus this cannot be silently dropped.

This sounds very reasonable to me.

In particular, there are some user-visible changes to the textual representation of clif instructions - please let me know if this approach looks good:

This sounds good to me.

uweigand commented 3 years ago

Here's an updated patch set along those lines: https://github.com/uweigand/wasmtime/pull/2

I'm still not completely happy, in particular because of the following two issues:

Comment on these questions as well as general comments on the approach are welcome - thanks!

cfallin commented 3 years ago

At this point, there's usually no TargetIsa available so the endianness needs to be hardcoded. (In any case it is later ignored anyway.) @cfallin you added this recently -- would it maybe be preferable to just attach a may_trap flag instead?

Hmm, yeah, I think we may want to reconsider this. I had refactored things away from attaching a SourceLoc to every may-trap memory instruction explicitly, as that was becoming untenable, and plumbing the flags through was simple enough; but considering the bits that actually exist in the MemFlags, the trap-ness is the only thing that wouldn't otherwise translate to a machine-specific thing (e.g. alignedness should manifest in how the lowering chooses instructions, if at all, not in flags on one machine instruction).

In particular, the endianness flag at the CLIF level should translate to a different instruction sequence at the VCode level on most machines (excepting only a hypothetical machine that has a "swap bytes" bit on every load instruction). So it would be suboptimal to be able to represent meaningless flags at the VCode level.

An alternative we had considered was to instead make the trap-ness part of a new "instruction metadata", along with the safepoint bit, that lives alongside the MachInsts in the body rather than inside of them. I'm starting to warm up to this more; trap-ness is analogous to safepoint-ness in that both result in metadata attached to the MachBuffer during emission, and both are properties of the instruction that are known when it is given to the LowerCtx.

So I might suggest the following: (i) create a MachInstMetadata that has, for now, two flags (safepoint, may_trap); (ii) unify LowerCtx::emit() and LowerCtx::emit_safepoint() to emit_with_opts() and emit(), the latter forwarding to former with MachInstMetadata::default(); (iii) store the metadata alongside instructions; (iv) feed the metadata to EmitState where we currently attach the stackmap on safepoints.

That's a large enough refactor that, for now, I think your PR's approach is fine (keep MemFlags on instructions, add a bit of verbosity to emission tests); but eventually we can clean things up.

GlobalValue "load" instructions

Yes, these should also carry MemFlags, I suspect; and since the lowering happens during CLIF legalization, we can just pass the flags through.

uweigand commented 3 years ago

That's a large enough refactor that, for now, I think your PR's approach is fine (keep MemFlags on instructions, add a bit of verbosity to emission tests); but eventually we can clean things up.

GlobalValue "load" instructions

Yes, these should also carry MemFlags, I suspect; and since the lowering happens during CLIF legalization, we can just pass the flags through.

Thanks @cfallin, I've now updated the patch to carry a MemFlags in GlobalValueData::Load and left the MemFlags on machine instructions for now. The updated patch is now posted as PR #2488 above.

cfallin commented 3 years ago

Fixed with #2492.