String subsection - Githubissues

jgravelle-google commented 5 years ago

There was some discussion about using strings in the imports+exports, as well as immediates to call instructions in the (soon to be proposed) interface adapter function instructions. There's a few questions here, 1) Should we constrain the design of this proposal such that it must be polyfillable? 2) Are strings too inefficient size-wise? 3) Are they uncomfortable semantics-wise? 1) I think is an interesting discussion and we should have it somewhere (though I'm inclined to just say "yes"), and I don't think I've heard anyone actually say 3), but we might want something related to 2) regardless.

We will probably want to have a subsection for strings, that deduplicates them. For example, given a set of import bindings (syntax + semantics are made up, just note structure):

(@interface func $foo (export "foo") (allocator "malloc") ...)
(@interface func $bar (export "bar") (allocator "malloc") ...)
(@interface func $baz (export "baz") (allocator "malloc") ...)

we can translate that to a more binary-equivalent:

(@interface-section
  (string-subsection
    (0 "foo")
    (1 "bar")
    (2 "baz")
    (3 "malloc")
  )
  (func-subsection
    ($foo (export 0) (allocator 3) ...)
    ($bar (export 1) (allocator 3) ...)
    ($baz (export 2) (allocator 3) ...)
  )
)

This is similar to having a type subsection, which we already need.

This is separate from the names section, because 1) polyfill-friendliness means keeping all the data in the custom section itself, and 2) these are non-omittable string imediates. I imagine most modules will have enough repetition that this out-of-lining will be a good size savings on average.

rossberg commented 5 years ago

I'm indifferent on having such a section, but like to throw in a note on the design of the S-expr syntax. In Wasm, the head of every subexpression always is a keyword -- or an explicit constructor name, if you think of it as an AST. It intentionally avoids "headless" S-exprs like the entries in your string/func-subsection. That makes it somewhat more verbose, but also more uniform and easier to parse/print in a generic manner, i.e., without depending on knowing the grammar.

alexcrichton commented 5 years ago

For the question of polyfillability I've opened https://github.com/WebAssembly/webidl-bindings/issues/58 to have a dedicated issue for that, but if we go with string-based APIs I would at least naively agree that we should have a string-subsection sort of system where the string "malloc" doesn't have to show up all over the place in adapter expressions/instructions.

That being said this seems like the sort of thing that gzip is really good at, so it's probably worthwhile to hold off on having a string subsection for awhile (for simplicity) and then when we start seeing some larger modules we could experiment with different encodings. Basically compare a gzip'd module encoded without a string subsection and a gzip'd module with a string subsection.

jgravelle-google commented 5 years ago

That makes it somewhat more verbose, but also more uniform and easier to parse/print in a generic manner, i.e., without depending on knowing the grammar.

Yeah, that was more of a sketch of how the binary format could be laid out. I'd want to leave this string deduplication out of the text format entirely, e.g. avoiding (string $malloc "malloc") and (export $malloc), preferring just (export "malloc").

That being said this seems like the sort of thing that gzip is really good at

There's also the concern of keeping the wasm binary in memory after decoding, and size on disk (probably cached). Gzip helps with network transit, but not peak memory usage. Though you're right that we do need to make sure we avoid making gzip's job harder, so it's probably wise to start simple and refactor once we get real data.

WebAssembly / interface-types

String subsection #57