Open wingo opened 2 years ago
No strong feelings either way. IIUC, for non-optional sections, their name doesn't really matter, so (3) and (4) seem like essentially the same thing for now: we'd introduce "section 14" before the globals section, and time will tell whether some future proposal will put additional (non-string-constant) things into that section.
There is a difference wrt. validation: if the section is only for string literals, then it can be validated as it is encountered. If we want to allow for the possibility of there being arbitrary data in there, validation could only happen once a string.const
instruction is encountered that references one of the byte vectors in section 14. But unless I'm missing something, this issue can be entirely postponed to a time where an actual future proposal comes along that wants to introduce such uses. (Example: if an engine now implements eager "blind" validation of section 14 requiring all-wtf8 content, and a new proposal comes along, then presumably modules using the new proposal's features won't run on this engine anyway. When that engine is updated to support the new features, this updating work will have to include changes to the validation of section 14. Once this work has been done, both old and new modules will validate just fine. So I don't see any particular backwards compatibility hazard.)
Andy, can you elaborate on how we could re-use the data section for string.const
?
Specifically, I'm wondering:
string.const
refer to just a data segment or a segment+offset+size?passive
or a new string
kind?
a. passive
would defer validation of WTF-8 until referenced in string.const
, which seems sub-optimal. It would also disallow eager transcoding of the string data into the engine's preferred encoding until string.const
.
b. string
would allow eager validation and transcoding, but adds a new case to data segments (which are already a bit complicated). Is it possible to data.drop a 'string' segment? Or 'memory.init' from a string segment? What format do we use for memory.init
if we have done an eager transcode?Currently, I would lean towards a stringliteral
section to avoid overloading the data section any more.
As for placing the stringliteral
section before the global
section, I would be concerned about the size of this section. If I understand correctly, the data
section was placed after the code
section so that we could begin JIT compilation as fast as possible when streaming. That logic would seem to apply here as well.
There's a similar ordering issue with array.init_from_data
that prevents it from being a constant expression. I would personally advocate for a split of the global section into 'declarations' and 'initializers' to solve this for both arrays and strings, while still keeping the large data sections at the end of a module.
Specifically:
globaldecl
section would define the index space and typesglobalinit
section has init expressions for each declarationglobaldecl
section in place of the current combined global
sectionglobalinit
section after the data
section when global
section is usedstringliteral
section before the data
sectionI believe this has been proposed before, but I can't find it written anywhere.
Humm! Very good thoughts.
Regarding the operand to string.const
-- the idea was that string.const
refers to the nth string in the string literals table, and that at run-time the instance would keep around an association between literal indices and string contents (probably via a reference to the vec(u8)
bytes and/or via a cached stringref value).
Regarding how to re-use the data section -- the idea was a new segment type. data.drop
wouldn't really make sense in this idea because there's really no way to access the data other than via string.const
. This might be an argument against using a data segment.
Regarding globaldecl
/globalinit
and stringrefs -- I think you would ideally want string.const
with an invalid index to be a validation-time error everywhere. It's a static property of the module, after all. Moving the global initializers later would allow this for globals, but doesn't solve the problem for element and code sections. Splitting the global section does sound like a nice idea, of course.
You are right that the string literals table could be large enough that you'd ideally want to punt it to after the code. In that case I would propose that we have two new sections, then: stringliteralcount
and stringliteral
. stringliteralcount
comes before globals and stringliteral
comes before data. WYDT @eqrion & @jakobkummerow ?
That all sounds reasonable to me. Given that the functions
/code
split is well-established and useful, I'm easily convinced that other kinds of module contents benefit from a similar split.
I have minor concerns about performance. So it might be very good to get some experimental data (on both expected section sizes, and decoding/validation/instantiation performance).
The background here is that in V8, for performance reasons we have already introduced split handling of global initializers: for cheap ones (i32.const
, ref.null
), we decode them eagerly during validation and store the result right away; more expensive ones store a reference to their wire bytes and get decoded again later (during instantiation). The same constraints that led to this implementation choice may well also influence module design: when an entire initializer can be stored in a few bytes, then doing that is both smaller (in wire bytes size) and cheaper to decode than enforcing a split for all cases.
Conceivably, we might find something similar for strings. If the stringliteralcount
is really just a count, there's no wiggle room; if we end up with a stringliteraldecl
section that has an entry for each literal, then it could make sense to allow both inline storage of short strings (up to some TBD threshold length), and a "see offset/entry X in stringliteralcontents
" encoding.
This leads to a related idea: for the globals section, we could try to find a way to be backwards-compatible, maybe roughly: instead of introducing a new section code for globaldecl
, evolve the existing globals
section, keep allowing already-shipped initializer instructions, and introduce a sentinel that means "see below, in the globalinit
section".
I'm not sure yet what I think the spec'ed restrictions on string contents validation should be. On the one hand, early errors are nice; on the other hand, if we truly feel that we should design for huge string sections, shortening the critical path to application startup by allowing deferred work is also nice. Checking UTF-8 validity is quite fast and can be done with just local knowledge, so it's probably not a huge deal in this case because engines can hide this work in a background task -- as long as the design doesn't somehow preclude them from doing that.
So, for early prototyping right now, I think we can go ahead with a single stringliteral section, and split it later. The worst that could happen is that in the meantime we get actual data from early adopters about typical sizes of their string literals sections ☺
Humm humm! Perhaps this could work, then:
stringliteralsec ::= sec_14 (deferred:0x00 immediate:vec(vec(u8)))
Where the total string literal count could be |immediate| + deferred
, with numbering starting with the immediate strings, and deferred
could be expanded in some future to be a u32 indicating further data in a stringliteralcontents
section to appear at some point after code
.
Solution proposed in https://github.com/WebAssembly/stringref/issues/11#issuecomment-1130132626 is what is in the binary spec currently. Retitling issue to reflect current question: whether to add a deferred string literals table.
The spec for
string.const
is that it loads an element by index from a vector of string literals. Each module has an associated vector of string literals (possibly empty). This vector is encoded as a vector of WTF-8 (or possibly UTF-8; #2) sequences (vec(vec(u8))
), and should be eagerly validated at validation time for valid WTF-8.I initially thought the literals vector would be a section on its own, but it was rightly pointed out that it would be cleaner if possible to make it a segment in the data section. However right now the ordering of sections goes type, import, func, table, mem, global, export, start, elem, datacount, code, data. This ordering prevents some useful properties:
string.const
immediate operand in init expressions for globals and elem segmentsstring.const
immediate operand is in range in code sectionI think probably everyone agrees that it would be good if a
string.const
instruction with an out-of-range operand were a compile-time error, and compatible with one-pass validation. I see a few ways to do this:string.const
operand seen when parsing module, and after processing data section, and error if any out of range access were present.stringliteralcount
, ordatacount
if it didn't come so late.I think (4) might be best? If people agree then please say so. But in the meantime I would plug ahead with (3).