Binary format: consider deferring some string literals until after code section

The spec for string.const is that it loads an element by index from a vector of string literals. Each module has an associated vector of string literals (possibly empty). This vector is encoded as a vector of WTF-8 (or possibly UTF-8; #2) sequences (vec(vec(u8))), and should be eagerly validated at validation time for valid WTF-8.

I initially thought the literals vector would be a section on its own, but it was rightly pointed out that it would be cleaner if possible to make it a segment in the data section. However right now the ordering of sections goes type, import, func, table, mem, global, export, start, elem, datacount, code, data. This ordering prevents some useful properties:

validation of string.const immediate operand in init expressions for globals and elem segments
validation that string.const immediate operand is in range in code section

I think probably everyone agrees that it would be good if a string.const instruction with an out-of-range operand were a compile-time error, and compatible with one-pass validation. I see a few ways to do this:

Use segment in data section. Record highest string.const operand seen when parsing module, and after processing data section, and error if any out of range access were present.
- Advantage: No new sections.
- Disadvantage: Late error detection is not very webassemble-y?
Use segment in data section, but add a new section to indicate salient early information like number of string literals. Think stringliteralcount, or datacount if it didn't come so late.
- Advantage: Data still in data section.
- Disadvantage: You add another junk section.
Add new stringref literal section, and put it just before its first potential use, so before the globals section.
- Advantage: Would work fine.
- Disadvantage: It seems silly to make a section for stringref literals. Also there is a code smell in that there are already a number of other string tables, so the obvious section name "strtab" isn't really usable.
Same as (3), but generalize it to an "early data section". The string literals would be in a stringref segment in that section.
- Advantage: Same as (3), but with some future extensibility?
- Disadvantage: Is there actually a need for such extensibility? If not let's not bother.

I think (4) might be best? If people agree then please say so. But in the meantime I would plug ahead with (3).

No strong feelings either way. IIUC, for non-optional sections, their name doesn't really matter, so (3) and (4) seem like essentially the same thing for now: we'd introduce "section 14" before the globals section, and time will tell whether some future proposal will put additional (non-string-constant) things into that section.

There is a difference wrt. validation: if the section is only for string literals, then it can be validated as it is encountered. If we want to allow for the possibility of there being arbitrary data in there, validation could only happen once a string.const instruction is encountered that references one of the byte vectors in section 14. But unless I'm missing something, this issue can be entirely postponed to a time where an actual future proposal comes along that wants to introduce such uses. (Example: if an engine now implements eager "blind" validation of section 14 requiring all-wtf8 content, and a new proposal comes along, then presumably modules using the new proposal's features won't run on this engine anyway. When that engine is updated to support the new features, this updating work will have to include changes to the validation of section 14. Once this work has been done, both old and new modules will validate just fine. So I don't see any particular backwards compatibility hazard.)

Andy, can you elaborate on how we could re-use the data section for string.const?

Specifically, I'm wondering:

Would string.const refer to just a data segment or a segment+offset+size?
Would the data segment kind be passive or a new string kind? a. passive would defer validation of WTF-8 until referenced in string.const, which seems sub-optimal. It would also disallow eager transcoding of the string data into the engine's preferred encoding until string.const. b. string would allow eager validation and transcoding, but adds a new case to data segments (which are already a bit complicated). Is it possible to data.drop a 'string' segment? Or 'memory.init' from a string segment? What format do we use for memory.init if we have done an eager transcode?

Currently, I would lean towards a stringliteral section to avoid overloading the data section any more.

As for placing the stringliteral section before the global section, I would be concerned about the size of this section. If I understand correctly, the data section was placed after the code section so that we could begin JIT compilation as fast as possible when streaming. That logic would seem to apply here as well.

There's a similar ordering issue with array.init_from_data that prevents it from being a constant expression. I would personally advocate for a split of the global section into 'declarations' and 'initializers' to solve this for both arrays and strings, while still keeping the large data sections at the end of a module.

Specifically:

Split the global section to 'global declaration' and 'global initialization' sections
- The globaldecl section would define the index space and types
- The globalinit section has init expressions for each declaration
- Unrelated, but this might help with WebAssembly/extended-const#9 too
Allow using the globaldecl section in place of the current combined global section
Require a globalinit section after the data section when global section is used
Place the stringliteral section before the data section
(optional/unrelated): Allow global init expressions to refer to previous globals, not just imported globals

I believe this has been proposed before, but I can't find it written anywhere.

Humm! Very good thoughts.

Regarding the operand to string.const -- the idea was that string.const refers to the nth string in the string literals table, and that at run-time the instance would keep around an association between literal indices and string contents (probably via a reference to the vec(u8) bytes and/or via a cached stringref value).

Regarding how to re-use the data section -- the idea was a new segment type. data.drop wouldn't really make sense in this idea because there's really no way to access the data other than via string.const. This might be an argument against using a data segment.

Regarding globaldecl/globalinit and stringrefs -- I think you would ideally want string.const with an invalid index to be a validation-time error everywhere. It's a static property of the module, after all. Moving the global initializers later would allow this for globals, but doesn't solve the problem for element and code sections. Splitting the global section does sound like a nice idea, of course.

You are right that the string literals table could be large enough that you'd ideally want to punt it to after the code. In that case I would propose that we have two new sections, then: stringliteralcount and stringliteral. stringliteralcount comes before globals and stringliteral comes before data. WYDT @eqrion & @jakobkummerow ?

That all sounds reasonable to me. Given that the functions/code split is well-established and useful, I'm easily convinced that other kinds of module contents benefit from a similar split.

I have minor concerns about performance. So it might be very good to get some experimental data (on both expected section sizes, and decoding/validation/instantiation performance). The background here is that in V8, for performance reasons we have already introduced split handling of global initializers: for cheap ones (i32.const, ref.null), we decode them eagerly during validation and store the result right away; more expensive ones store a reference to their wire bytes and get decoded again later (during instantiation). The same constraints that led to this implementation choice may well also influence module design: when an entire initializer can be stored in a few bytes, then doing that is both smaller (in wire bytes size) and cheaper to decode than enforcing a split for all cases. Conceivably, we might find something similar for strings. If the stringliteralcount is really just a count, there's no wiggle room; if we end up with a stringliteraldecl section that has an entry for each literal, then it could make sense to allow both inline storage of short strings (up to some TBD threshold length), and a "see offset/entry X in stringliteralcontents" encoding.

This leads to a related idea: for the globals section, we could try to find a way to be backwards-compatible, maybe roughly: instead of introducing a new section code for globaldecl, evolve the existing globals section, keep allowing already-shipped initializer instructions, and introduce a sentinel that means "see below, in the globalinit section".

I'm not sure yet what I think the spec'ed restrictions on string contents validation should be. On the one hand, early errors are nice; on the other hand, if we truly feel that we should design for huge string sections, shortening the critical path to application startup by allowing deferred work is also nice. Checking UTF-8 validity is quite fast and can be done with just local knowledge, so it's probably not a huge deal in this case because engines can hide this work in a background task -- as long as the design doesn't somehow preclude them from doing that.

So, for early prototyping right now, I think we can go ahead with a single stringliteral section, and split it later. The worst that could happen is that in the meantime we get actual data from early adopters about typical sizes of their string literals sections ☺

Humm humm! Perhaps this could work, then:

stringliteralsec ::= sec_14 (deferred:0x00 immediate:vec(vec(u8)))

Where the total string literal count could be |immediate| + deferred, with numbering starting with the immediate strings, and deferred could be expanded in some future to be a u32 indicating further data in a stringliteralcontents section to appear at some point after code.

Solution proposed in https://github.com/WebAssembly/stringref/issues/11#issuecomment-1130132626 is what is in the binary spec currently. Retitling issue to reflect current question: whether to add a deferred string literals table.

WebAssembly / stringref

Binary format: consider deferring some string literals until after code section #11