Decide on binary encoding

wingo commented 2 years ago

We have about 9 stringref instructions and 14 stringview instructions. They need opcodes. I propose to use the 0xfc prefix (spidermonkey calls it the "misc" prefix, v8 calls it "numeric" but really it's quite miscellaneous). There are currently 18 opcodes assigned in this prefix, from saturating float-to-int conversions, bulk memory operations, and some reftypes instructions (table.grow, table.size, table.fill). I'm combing the stage 1 to 3 proposals for other allocations, but if not I'd just take the opcodes that I need starting at 0x12.

Another option would be the GC prefix, but given that we don't depend on GC, and the future scope of GC is less limited, probably it's best to leave the GC prefix to the GC people.

Not sure which typecodes to take at the current time. This issue is for anyone who has opinions; I'll prepare a PR shortly.

jakobkummerow commented 2 years ago

I don't feel strongly about it either way, just some thoughts:

We could even introduce a new prefix, at least for now.
If reusing an existing prefix, I'd leave some gap to existing proposals, in case they add an instruction (e.g. start at 0xfc20 or 0xfc40), to avoid clashes between evolving in-flight proposals.
You could also leave gaps between blocks of related instructions (like we did for GC: 0xfb0 is struct-related, 0xfb1 is array-related, 0xfb3_ is i31 related, etc), but that's probably much less useful (it's mostly for aesthetics).
Empirically, many proposals go through some evolution of their instruction set, which often causes a desire to clean up the encoding space (to compact gaps and/or get nicer grouping) just before finalizing the proposal. Assuming this will happen for stringrefs as well, that's a reason why all decisions made now are temporary anyway, and we'll have a chance to reconsider them later.
I agree that starting at 0xfc12 would be a nice final state. That said, even the final color of this bike shed probably doesn't matter too much :-)

lars-t-hansen commented 2 years ago

As I've argued in connection with relaxed SIMD (https://github.com/WebAssembly/relaxed-simd/issues/51), prefixes are going to become dear if we keep introducing new ones even for small groups of instructions; it is better to reuse one if that is natural, and there are some suggestions in that thread for how to fit new instructions into the space of an existing prefix. IMO the misc prefix is a pretty good candidate, probably better than GC.

rossberg commented 2 years ago

Personally, I'd categorise this as a GC-related feature, so the GC prefix seems most adequate. It may not depend on any of the constructs in the GC proposal, but it clearly requires automatic memory management. (@titzer always argued we should say managed data, because GC is an implementation detail.)

wingo commented 2 years ago

Thanks for feedback. In parallel I drew up a draft based on the misc prefix + contiguous numbering. Happy to switch to GC if that's consensus. (I assume @rossberg that you are considering reference counting to be automatic memory management :) Happy also to use e.g. the end of the misc space, which would explicitly punt on final allocation until later.

wingo commented 2 years ago

I also have some draft allocations for additional reftype entries. It's a little gross but the space is filling up. Also there's a new section, 13; I didn't check beforehand and I now see that exception handling uses this section number and I will need to change.

titzer commented 2 years ago

I do like the idea of categorizing this under "managed data" and putting it under the 0xFB prefix.

wingo commented 2 years ago

On https://github.com/WebAssembly/stringref/pull/10 I have the opcodes starting at 0xfb80. However with other prefixes we are starting to use LEB encoding for the sub-instructions, which would push us into three-byte territory. I think there should be fewer than 256 GC instructions in the end? And if not we can do a nested prefix thing. Anyway current proposal is 0xfb 0x80 and so on, i.e. two bytes.

rossberg commented 2 years ago

I can't find the relevant notes quickly, but this was discussed at some meeting, and I believe we decided that all secondary instruction opcodes should be in LEB.

jakobkummerow commented 2 years ago

@rossberg Interesting; you're probably referring to this meeting and this issue.

I'm now finding this decision somewhat unfortunate; that said it should become a non-issue for the case at hand if we compact the encoding space before finalizing the GC and stringrefs proposals.

My reason to dislike it is what different encodings mean for the number of encodable instructions per wire bytes size, where we essentially have a spectrum of how many leading 1-bits are used in the 2nd byte (with the 1st byte being the prefix) to indicate "there's a 3rd byte coming"; LEB is (a variant of) the "1" case:

marker 1-bits	2nd byte mask	encodable 2-byte instructions	encodable 3-byte instructions (2nd byte range × 3rd byte range)
1 (LEB)	0x80	128 (0x00 - 0x7F)	16384 (0x80..0xFF × 0x00..0x7F)
1 (max 3 bytes)	0x80	128 (0x00 - 0x7F)	32768 (0x80..0xFF × 0x00..0xFF)
2	0xC0	192 (0x00 - 0xBF)	16384 (0xC0..0xFF × 0x00..0xFF)
3	0xE0	224 (0x00 - 0xDF)	8192 (0xE0..0xFF × 0x00..0xFF)
4	0xF0	240 (0x00 - 0xEF)	4096 (0xF0..0xFF × 0x00..0xFF)
...	...	...	...
7	0xFE	254 (0x00 - 0xFD)	512 (0xFE..0xFF × 0x00..0xFF)
8	0xFF	255 (0x00 - 0xFE)	256 (0xFF × 0x00..0xFF)

I don't know what overall Wasm instruction set size y'all are expecting, but it seems more likely to me that we'll want to encode 128 < N <= 240 opcodes with a given prefix in just 2 bytes, than that we'll want to support more than several thousand (>4K, or even >16K in a 4-byte LEB sequence) opcodes in a single prefix. For reference, V8 currently knows about 51 GC-prefixed opcodes (excluding stringref), 25 stringref-related opcodes, and 254 SIMD opcodes (including relaxed; the higest-used index is 273, but there are a few gaps). So even if a large proposal like GC will double in size over time, and even if a huge feature set like SIMD will double in the future, we'll still be far away from needing thousands of opcodes per prefix.

There is the obvious consistency argument, but in my experience it's trivially easy for a decoder to select prefix-specific behavior after seeing the prefix byte, so there's no practical issue with each opcode doing its own thing (e.g. 0xFD/Simd can do LEB while 0xFB/GC could pick line 4 in the table above and 0xFA/future-who-knows-what could do yet something else).

titzer commented 2 years ago

@jakobkummerow Are you proposing a different variable length encoding scheme? The above kind-of looks like UTF-8. (Personally I hate UTF-8--we have that too in Wasm, but only to check for valid strings for import/exports!).

Personally I don't think a.) all prefixes should use the same scheme for the following bytes, and b.) saving a byte for instructions 128-256 after a prefix isn't worth it.

I would prefer we stick with LEBs for all the variable-length things.

rossberg commented 2 years ago

Personally I don't think a.) all prefixes should use the same scheme for the following bytes, and b.) saving a byte for instructions 128-256 after a prefix isn't worth it.

Agreed with @titzer. And yes, we should compact the opcode space for the GC proposal at some point.

jakobkummerow commented 2 years ago

@titzer

Are you proposing a different variable length encoding scheme?

No, what I'm saying is:

I don't see harm in letting each prefix decide how to encode its prefixed space of opcodes. (SIMD is stuck with prefix+LEB; that ship has sailed. It's fine: implementations simply do if (*pc == 0xFD) { read_simd_opcode(pc+1); } or something similar.)
I don't see opcodes as very good candidates for variable-length encoding schemes, because the set of opcodes is relatively small and relatively fixed (compared to, say, i32.const constants, type indices, and other module-controlled things). The LEB format has an obvious strength (namely: great scalability) and a cost (relatively few distinct values that can be stored in a single byte); IMHO opcodes don't really benefit from that strength (contrary to i32.const, I'm quite sure we'll never have 4 billion opcodes), so it's unfortunate when they have to pay the price.
I do worry about module wire size, so if there's a way to save one byte per instruction for a lot of instructions without incurring any practical drawbacks, then I would like us to do that.

For many prefixes, it would probably be fine to stick with a fixed-size encoding: prefix plus one more byte. If any prefix wants to allow extra room for future growth, it can accommodate that by reserving a few bit patterns; that's what leads to the table in my previous post. In the extreme case, a hypothetical prefix that thought it could get by with the 256 possible opcodes encodable by prefix+onebyte could at the very last minute decide to double its encodable range by assigning 0xFF to mean "there's a third byte" -- that's row 8 in the table.

Another way of phrasing it: I'm suggesting to let prefixes do what the main opcode encoding/prefixing scheme does (so one might call it "nested prefixes"). Consider: 0x00..0xC4 are currently one-byte opcodes, and 0xFB..0xFE are in use as prefixes. AFAICT, the spec never explicitly states what the boundary is between one-byte opcodes and prefixes; if it wants to go for a regular scheme eventually, then from existing uses we can already narrow down that it'll be one of the rows 3, 4, or 5 in the table above. (Row 2 would mis-classify 0xC4, row 6 would mis-classify 0xFB. Of course it could also end up being entirely irregular: if 0xFB and 0xF9 are prefixes, then 0xFA could technically still be a one-byte opcode.)

Clearly, when designing the original one-byte opcodes, the option to encode all opcodes as LEB was on the table, but folks decided that it would be sad if any opcode >=0x80 needed two bytes, so they went with a smaller range of reserved prefixes and a larger range of one-byte-encodable values. That sounds reasonable to me, and I don't see why within prefixed spaces the same reasoning should not hold.

Personally I hate UTF-8

I'm not questioning that opinion; in fact I'm not talking about personal preferences at all. I'm talking about technical merits: how many opcodes can we encode, how much space does that need, how quickly can we decode them.

wingo commented 2 years ago

Merged some binary encodings in #10. We seem to have rough consensus on using the GC prefix; fine. The remaining to-do item here is "to LEB, or not to LEB".

tlively commented 2 years ago

I would strongly prefer introducing a new prefix over changing the prefix+LEB encoding convention if we had to choose between the two. Having a single system for how opcodes are encoded makes the binary format much easier to understand than if each proposal were left to its own devices.

That being said, I would also be fine with using the GC prefix + LEBs. I expect string instructions to be similar to SIMD instructions in that they will mostly be concentrated in a few functions rather than spread pervasively throughout the code, so I don't think their sizes will be super important.

wingo commented 2 years ago

Am happy to switch to GC+LEBs. Also we can take the opportunity to just make separate instructions for everything that takes a wtf8 policy (#35) and renumber opcodes (##41).

jakobkummerow commented 2 years ago

For the record, the stringref prototype in V8 has switched to GC + LEB, which is aligned with what Binaryen emits for the stringref instructions.

WebAssembly / stringref

Decide on binary encoding #9