WebAssembly / shared-everything-threads

A draft proposal for spawning threads in WebAssembly
Other
35 stars 1 forks source link

Data count section is insufficient for shared data segments #83

Open tlively opened 2 weeks ago

tlively commented 2 weeks ago

The point of the data count section is to pre-declare the number of data segments so that the code section can be parsed to refer to pre-allocated segments. The code should be able to be generated and then the segments should be filled in once the data section is parsed and everything should just work. But there's not currently any way to know in advance if a data segment will be shared, so if the code needs to be generated differently for shared segments, we're plumb out of luck.

It seems we need to be able to emit the data section before the code section or introduce a beefier version of the data count section that can declare data segment types.

conrad-watt commented 2 weeks ago

if the code needs to be generated differently for shared segments, we're plumb out of luck.

Is there also a more immediate problem with streaming validation? If we say that memory.fill can't reference a non-shared segment in a shared function, the type of the segment isn't available in time for the code section to be validated.

I vaguely remember we discussed whether data segments would actually need shared and nonshared variants. Collapsing the distinction might theoretically leave some performance on the table, but might also be the least disruptive solution. If shared data segments did need different compilation, there could at least be an engine optimisation where modules with no shared functions could compile their data segments the "old" way. IIRC data segments can't be imported or exported so "thread escape analysis" is easier?

rossberg commented 2 weeks ago

Good points. As far as I'm concerned, the data count section was a wild hack and is an evolutionary dead end. Piling new hacks on top will just dig a deeper hole. The clean solution again is repeated and relaxed sections.

tlively commented 2 weeks ago

I would be happy if we required the data section to be ordered before the code section to support using shared data segments in the code. I don’t think we should jump all the way to fully relaxed sections without more motivation, though.

I agree that the data count section in particular is not something we want to grow organically to solve isolated problems, but I think the idea of systematically being able to decouple type declarations from their corresponding contents is fine. It’s what we do with the function section, for instance.

Do engines currently parallelize code section validation/compilation with loading of the data section?

jakobkummerow commented 1 week ago

Is there also a more immediate problem with streaming validation?

Yes, good point! Our prototype implementation isn't quite far enough to expose this yet, but I'm pretty sure we will run into this in due course.

repeated and relaxed sections

Unless I'm missing something, that solution would regress on the current design's motivation: to allow engines to start validating and/or compiling before the (potentially very large) data section has been downloaded.

Do engines currently parallelize code section validation/compilation with loading of the data section?

Yes: V8 validates functions concurrently while the rest of the module is still downloading. V8 in its current default configuration doesn't compile functions eagerly, but AFAIK other engines do that, and the Compilation Hints proposal will likely provide a way for modules to explicitly request that behavior, so I think we do need a way to support that.


Without reordering sections (and paying the resulting initialization latency price), other options would be:

(1) Introduce a "Shared Data Section", and corresponding "Shared Data Count" to precede the Code Section. (2) Require shared data segments in the data section to be grouped together at the beginning of the section, and have a separate Count section for them so we could have e.g. (module ... (data-count: 3) (shared-data-count: 2) (code: ...) (data: [shared], [shared], [non-shared]) ... ). The superficial details could be decided differently without changing the fundamental idea: make shared segments go last instead of first; make "data-count" and "shared-data-count" additive instead of the former being the total number of segments. (3) Allow arbitrary interleaving of shared and non-shared segments in the data section, and replace the "data count" section with (conceptually) a bit vector indicating the sharedness of each segment. Or perhaps we'd call that a "data type" section then, and it'd have a byte for each segment, of which seven bits are reserved for future kinds of segment type annotations. Obviously this would come at a module size cost.


Side note: I'd like to point out that terms like "wild hack" and "evolutionary dead end", while they might be appropriately dramatic for expressing a strong personal preference, are unlikely to help us find consensus on the technical merits of various ideas.

rossberg commented 1 week ago

@tlively, for a while I also thought that separating declarations from definitions may be a sufficient solution. However, it has at least two shortcomings: (1) it isn't general enough to solve all streaming issues (e.g., in some cases — like globals — you may actually need to know the definition in time, not just its type), and inversely, (2) it also is too general, as it would naively allow arbitrary recursion between all definitions, unless additional restrictions akin to ordered sections are imposed (globals are again an example, where we must prevent cyclic definitions, including with element segments).

So some form of sequential syntax & semantics for definitions appears to be needed either way (and is a natural fit for streaming). But that generally requires to be able to split up certain sections into multiple parts. And once we introduce that, it's not clear that we still gain a lot by also splitting declarations from definitions, other than more complexity.

@jakobkummerow, it helps to identify our mistakes instead of fooling ourselves and pretend it was a well-designed and scalable solution. This isn't about blame but about acknowledging the situation — I'm blaming myself for it as well. FWIW, we already have a similar issue with globals vs element sections, which prevented us from allowing array.new_elem as a constant expression. And pre-declaring alone won't solve that instance, for the reasons mentioned above. So yes, our current approach is a dead end.

eqrion commented 1 week ago

@tlively

Do engines currently parallelize code section validation/compilation with loading of the data section?

Yes, SpiderMonkey can start compiling functions in parallel and on background threads as soon as we download them off the network. Having to download the data section first before beginning compilation would regress the total time to compile a module.

@conrad-watt

I vaguely remember we discussed whether data segments would actually need shared and nonshared variants. Collapsing the distinction might theoretically leave some performance on the table, but might also be the least disruptive solution. If shared data segments did need different compilation, there could at least be an engine optimisation where modules with no shared functions could compile their data segments the "old" way. IIRC data segments can't be imported or exported so "thread escape analysis" is easier?

I wasn't able to find any discussion of this, but I personally don't know of a reason to have the shared/nonshared distinction for data segments. SpiderMonkey already represents every data segment as an atomic refptr to an immutable vector of bytes. We do this because we hold on to the data segments from the module object, and modules are shared on the web (and probably everywhere?). So I don't think we'd ever really want to allocate these in some sort of local heap.

I also don't know of any code generation differences we'd do. Data segment payloads are immutable, so it's not like we'd need to emit any different kinds of loads/stores when reading from them from multiple threads.