formalize component value definitions

rvolosatovs commented 5 months ago

I used https://github.com/WebAssembly/component-model/blob/145752a9795f326875c347527040d0e2ebe87c35/design/mvp/Binary.md?plain=1#L363-L367, https://github.com/WebAssembly/component-model/blob/145752a9795f326875c347527040d0e2ebe87c35/design/mvp/Binary.md?plain=1#L265 as main points of reference for the initial approach.

Tried to stay as close as possible to core Wasm spec and the canonical ABI spec.

flags are encoded as variable-length uN bitmask - this does favor cases where flags with lowest indexes are set, however I feel like this is a reasonable approach to take at least for consistency with all other numerical values
enum and variant discriminants are encoded as variable-length u32-s, same argument applies as above. I do feel like for most use cases the discriminant should fit in a single byte
results and options match canonical ABI despecialization rules (https://github.com/WebAssembly/component-model/blob/main/design/mvp/CanonicalABI.md#despecialization)
- 0x00 is used to identify result::ok
- 0x01 for result:error
- option::none is 0x00
- option::some is 0x01
tuples and records are encoded as just the sequences of their field/element values, in order
- valtype is specified once in the top-level (value t:<valtype> v:<val(t)>), because, although type inference should be possible in most use cases by looking at the usages of the value index, values may potentially be unused by the component directly (for example, exported). I feel like it's a very little price to pay for simpler implementation and type checking.

val encoding reference implementation is available at https://github.com/rvolosatovs/wrpc/blob/4fd7488caa4cff19079f851cd7d0efda8f613a64/crates/transport/src/lib.rs#L2557-L3223

Refs #15

lukewagner commented 5 months ago

This looks good to me, so I'll wait another couple of days to let folks comment and if nothing new comes up, merge.

lukewagner commented 4 months ago

In case anyone isn't also following #352, the recent change above prefixes each value definition with a byte length (analogous to function bodies), so that values can be skipped-over during decoding or decoded in parallel, while still eagerly populating the value index space.

alexcrichton commented 4 months ago

One part of this that I would be concerned about this is that this non-trivially increases the complexity of the binary-to-text tooling. Today binary-to-text for core modules and components doesn't require any validation to happen. With this new section, however, it's not possible to translate binary-to-text without a validator because you'll need to keep track of the component type section to print the value section. A kernel of this concern can be seen in https://github.com/bytecodealliance/wasm-tools/pull/1541 where the decoding step has to take type information which isn't easily acquired right now.

Unfortunately I don't know of a great way to simplify this which doesn't involve lots of discriminant bytes everywhere which seems like this was trying to avoid, however.

lukewagner commented 4 months ago

@alexcrichton Yes, that's a real downside for sure. What's hard to weigh is whether avoiding that problem is worth adding all the discriminants to the encoding of values, which ultimately would mean that we'll need a separate binary encoding for situations where we don't want the size/decode overhead. It's hard to say for sure whether anyone will want to store very large values in a component, but, farther out in the future, maybe list<record { ... }>-typed value exports end up being nicer than raw text files for representing metadata/config, so... maybe? But yeah, I feel the tension.

alexcrichton commented 4 months ago

One possible alternative, albeit not a pretty one, would be for the text format to reflect the binary format where it'd be something like:

(value $foo (type $the_type_index) "\0\0\01\02")

where the text format just has the raw bytes and doesn't try to pretty-print the shape

lukewagner commented 4 months ago

@alexcrichton Oh right, I forgot that option! So then the byte-string (using the same string-literal format as data segments) could be considered the "base" AST whereas the (list 1 2 3) syntax is optional sugar.

alexcrichton commented 4 months ago

Ah yeah I like the idea of having sugar as well! I had forgotten that was an option, and that should make it much easier to hand-write as well.

Orthogonally, though, wouldn't (list 1 2 3) be invalid and you'd need something like (list (i32 1) (i32 2) (i32 3)) or similar? (basically won't the text format need to know how big the integer is supposed to be?)

lukewagner commented 4 months ago

Good point! Because yeah, whether we have discriminants in the text format is independent of whether we have discriminants in the binary format. I was thinking earlier only about the decoding direction, where you need type info to render values anyhow; but in the encoding direction, having the discriminants lets you parse without type info, addressing the original concern in #352. It also lets you write invalid test cases more-easily.

So at the moment we only require discriminants where required by the more-fundamental WAT convention that all S-Expression lists start with an atom, hence the discriminant in (list ...) and (record ...), but based on this reasoning, I think it makes sense to add them for all the scalars too, hence (list (i32 1) (i32 2) (i32 3)), like you're saying. Does that make sense to you too @rvolosatovs?

rvolosatovs commented 4 months ago

Good point! Because yeah, whether we have discriminants in the text format is independent of whether we have discriminants in the binary format. I was thinking earlier only about the decoding direction, where you need type info to render values anyhow; but in the encoding direction, having the discriminants lets you parse without type info, addressing the original concern in #352. It also lets you write invalid test cases more-easily.

So at the moment we only require discriminants where required by the more-fundamental WAT convention that all S-Expression lists start with an atom, hence the discriminant in (list ...) and (record ...), but based on this reasoning, I think it makes sense to add them for all the scalars too, hence (list (i32 1) (i32 2) (i32 3)), like you're saying. Does that make sense to you too @rvolosatovs?

Sounds good to me! Especially since this feature is considered unstable, perhaps we could revisit this later if we encounter any issues in the future

rvolosatovs commented 4 months ago

For reference, in its' current shape binary encoding proposal is implemented at https://github.com/rvolosatovs/wrpc/blob/84960bb739fd2a26c1d534c67954bbace0f2f005/crates/runtime-wasmtime/src/lib.rs#L482-L1090 in terms of Wasmtime types and values.

rvolosatovs commented 4 months ago

Updated the PR according to latest feedback gathered on and off-line. PTAL

rvolosatovs commented 4 months ago

I think this is ready for the next review round

WebAssembly / component-model

formalize component value definitions #336