WebAssembly / threads

Threads and Atomics in WebAssembly
https://webassembly.github.io/threads/
Other
703 stars 50 forks source link

Can we avoid data segment fill at every instantiation? #62

Open juj opened 7 years ago

juj commented 7 years ago

Reading https://github.com/WebAssembly/threads/blob/master/proposals/threads/Overview.md#initializing-memory-only-once, with

The data segments are always copied into linear memory, even if the same module 
is instantiated again in another agent. One way to ensure that linear memory is only 
initialized once is to place all data segments in a separate module that is only 
instantiated once, then share the linear memory with other modules.

if I understand this correctly, this will prevent Emscripten and other compilers from being able to place data and code sections in the same .wasm file, but must always use two separate files for them, similar to how .mem files are being used in asm.js.

It sounds like it would be easy to add a Module.instantiate(..., noDataSegment: true); type of machinery to avoid this? Given that there's an explicit paragraph about this in the document, I presume this approach has been considered already before, and was turned down? I'm curious, what was the conversation about it back then?

lars-t-hansen commented 7 years ago

It's good of you to bring that up again. The previous discussion started here: https://github.com/WebAssembly/threads/issues/16#issuecomment-304600033. There was IMO no consensus around the matter, but the matter was closed (by @binji) for reasons that are not clear to me.

juj commented 7 years ago

The drive to have one .wasm file that contains both the code and the data sections is commonly driven by requests from the community, i.e. some Emscripten users prefer to have as few files in the output as possible, to the point that even embedding the .wasm into a .html file or a .js file has been asked.

Technically we can definitely make do with two files, one .wasm with a .mem file for data section, which I think we do already support for wasm as well, though I'm sure people will ask if they can optimize away the .mem file when it suddenly becomes two files again. It can be a bit easier to manage caching if there is only one file instead of two, and one has less risk of loading incompatible versions of the two.

Another technical note is that since we'll need to keep the Module in memory throughout the lifetime of the wasm application in order to be able to accommodate pthread_create of new threads, it will be more efficient to separate the two files, so the memory initializer/data segment can be dropped at startup after it has been applied, and only the wasm module with the code still kept around. (I'm sure that will not stop people from asking to have only one file instead of two)

Even with the this optimization aspect noted, if it's all the same, I think it would be useful to have a flag for this, especially considering if it is simple to implement in practice. CC @kripken

jfbastien commented 7 years ago

I suggest designing the details of this feature and presenting this design to the CG.

binji commented 7 years ago

Sorry about that, I thought there was discussion in the CG meeting about this, but it appears there wasn't. I'm happy to write up a design for this unless you want to @juj.

juj commented 7 years ago

Thanks Ben, that would be much appreciated!

lukewagner commented 7 years ago

Agreed it would be nice to solve this problem directly. (I like allowing single-.wasm outputs too.)

One question is whether it's sufficient to add a dynamic option to disable all segment initialization or whether it'll ever be the case that we'll want to run some, but not all. @juj ?

Separately, it seems like we have the options of: flags passed to instantiate et al or adding new init_expr fields to the module (in a backwards-compatible manner) which, if evaluated to zero, disable initialization. I think I prefer the latter given that we already use init_exprs to parameterize segment initialization.

Another technical note is that since we'll need to keep the Module in memory throughout the lifetime of the wasm application in order to be able to accommodate pthread_create of new threads, it will be more efficient to separate the two files, so the memory initializer/data segment can be dropped at startup after it has been applied, and only the wasm module with the code still kept around.

This varies by browser, I believe, but at least in FF, the bytecode is only kept alive by the Module object, not the Instance object, so assuming references are dropped to the Module after instantiation, the bytecode will be released at the next GC. (Machine code and runtime metadata are stored in a hidden third object, shared by a Module and all its Instances.) So this needn't be a technical reason to use two files.

jfbastien commented 7 years ago

It's also worth considering per-thread init versus global init.

binji commented 7 years ago

Separately, it seems like we have the options of: flags passed to instantiate et al or adding new init_expr fields to the module (in a backwards-compatible manner) which, if evaluated to zero, disable initialization. I think I prefer the latter given that we already use init_exprs to parameterize segment initialization.

Interesting. It looks like we can add this by hijacking what we currently call the memory index, which fortunately must always be 0.

I like this idea but how would this work exactly? We want to make sure that it initializes memory the first time, but not subsequent times -- doesn't this fact need to be passed into the instantiate function somehow? Or were you thinking that it could query a global or something...?

lukewagner commented 7 years ago

It's also worth considering per-thread init versus global init.

I think that's what this would allow ("global init" would mean you disable the segment after the first instantiation, "per-thread init" means you wouldn't disable, just pass in different offsets); is there something else you had in mind?

Interesting. It looks like we can add this by hijacking what we currently call the memory index, which fortunately must always be 0.

Haha, I had exactly the same devious thought :) We could rename it to "flags" and then "has non-default table/memory index" would just be another flag at some point in the future.

I like this idea but how would this work exactly? We want to make sure that it initializes memory the first time, but not subsequent times -- doesn't this fact need to be passed into the instantiate function somehow? Or were you thinking that it could query a global or something...?

The init_expr would be a get_global and so ultimately this would be controlled at each thread's instantiation site via the value of the imported global value. So the first thread would pass a 1 (to initialize) and subsequent threads would pass 0 (to not initialize).

binji commented 7 years ago

The init_expr would be a get_global and so ultimately this would be controlled at each thread's instantiation site via the value of the imported global value. So the first thread would pass a 1 (to initialize) and subsequent threads would pass 0 (to not initialize).

I see -- you're suggesting that we provide a second init_expr here. This is actually kinda neat, because then we can more precisely control data segment initialization. (Probably saying what everyone else already figured out, feeling a bit slow today ... :slightly_smiling_face:)

jfbastien commented 7 years ago

I think that's what this would allow ("global init" would mean you disable the segment after the first instantiation, "per-thread init" means you wouldn't disable, just pass in different offsets); is there something else you had in mind?

Yeah, all I'm saying is I'd like to see per-thread init as a possible expansion in the proposed design. Doesn't need to be spec'd yet, but I want to know it's doable and what it looks like. One takeaway is that it's a property of a segment, not of the module or instance, whether initialization is global or per-thread.

lars-t-hansen commented 6 years ago

@jfbastien, maybe you can write more about what specifically you expect out of "per-thread init" and what the use cases look like, so that we can move this issue along.

The init-guard design ("Solution 2" in the doc) allows differing init behavior for different threads by manipulating globals referenced from the init-guard and the init-address expressions. In a situation (as in a browser) where we use non-shared-memory workers for threads it also allows multiple threads with different init behaviors to be initialized concurrently because the globals are private to the workers. Host code must set up the globals but this does not seem problematic.

The current design has per-segment init-guards and certainly allows host code to choose different segments, or place them at different addresses, on a per-instance basis, by manipulating the globals.

(It's hard for me to guess what future non-worker-based threads will look like; it seems likely that we'll not instantiate a module multiple times in that scenario, and that we need a very different type of initialization logic. The flags field of the init-guard design allows for the addition of more init-expressions, or indeed allows the init-guard to be interpreted differently.)

lars-t-hansen commented 6 years ago

The Nov meeting notes record this exchange:

BS: Additional proposal that came up last time about initialization - turning that off and on JF: Dynamic linking? BS: Being able to say turn off initialization on this thread - not really about dynamic linking BN: Is this something we want to poll now? Or wait? ??: Let’s put it in now and see if anyone objects

That is not hyper-informative, but separately Luke informs me that "binji agreed to add conditional init_exprs to data/elem segments". That sounds like Solution 2 to me. @binji, can you confirm that?

binji commented 6 years ago

Sorry, I should have gone through these issues and updated them with info from the CG.

Yes, the consensus as I understood it was to move forward with solution 2.

rossberg commented 6 years ago

Still not a big fan of this feature, since conditional initialisation seems too ad-hoc and doesn't give you anything you cannot already express with suitable modularization.

In my mind the underlying problem is that our design of data and element sections as "active" initialisers was wrong. What we should have done is make them "passive" data sources, and add explicit instructions that "apply" a segment, e.g. inside the start function -- or somewhere else. For example,

init_memory $data_seg_index
init_table $elem_seg_index

This would be much more flexible than what we have now. For example, besides conditional initialisation, it would allow initialising some time else than upon instantiation, it would allow reinitialising/resetting again later, it would allow applying a single segment many times -- especially if we also moved memory/table index and offset to those instructions.

Maybe it's not too late to extend the segment model in that direction?

lars-t-hansen commented 6 years ago

@rossberg, generalizing it further, it is basically a memcpy from a static source outside the heap, is it not?

If we're going down that road, a flag in the segment could mean "no address field present, never copy automatically".

rossberg commented 6 years ago

@lars-t-hansen, yes, at least in the case of memories (not sure if that accurately describes table segments, though). Repurposing the current zero byte as a flag distinguishing passive from active is what I was thinking.

jfbastien commented 6 years ago

That (flag + apply operation) seems feasible.

binji commented 6 years ago

OK, I'll spend some time writing up this as an alternative soon.

lukewagner commented 6 years ago

One thing to consider when designing an alternative: the useful thing about "active" initializers is that, at the end of module instantiation, the bytecode can be released; the instance doesn't have any residual dependencies. If initialization is a dynamic operation that can happen at any time, at least the data and elem sections would have to be kept around forever ultimately leading to two copies of the data.

To avoid this problem we could impose a dynamic requirement that init_memory/init_table can only execute in an instance while the instance is executing its start function (trapping otherwise); this would allow bytecode to be released when it is today.

rossberg commented 6 years ago

@lukewagner, by bytecode, do you mean the binary?

Nothing prevents you from defining a module that explicitly calls the start function again later, so I'm not sure such a restriction would help. And it would actually be a feature if you could reuse segments from other functions.

Would this necessarily mean that data sections have to be kept alive indefinitely? In the engine I would assume that you'd keep segments alive only by having references to them from functions that use respective instructions. Once they become garbage (like the start function could), so do the segments.

lukewagner commented 6 years ago

@rossberg Yes, the binary. Also, my suggestion was to only allow init_memory/init_table during the instantiation phase that executed the start function, not literally any time the start function was active on the stack.

What you're suggesting about having functions keeping references to sections doesn't really work when the entire module is a single GC'able unit. At least in SM (I think others), we don't GC individual functions.

rossberg commented 6 years ago

@lukewagner, I see. Are you talking about a runtime error, then? For example. the init instructions would trap if executed later?

How difficult would it be for an implementation to make a special case for the start function? That is, find out whether the start function is referenced anywhere else (via export or call or elem), and if not, keep its code separate from the rest of the module so that it can be thrown away immediately?

lukewagner commented 6 years ago

Yes, a runtime trap.

I think the problem is that we'd have a very implicit/silent cliff: probably we don't want to be using callgraph analysis (we don't build that graph for any other purpose) so this would be an optimization only if the init ops were literally inside the start function (and it wasn't exported, elem'd, etc). And even if we did build a callgraph, a simple indirect function call (maybe introduced by virtual function call) would throw it all off.

binji commented 6 years ago

We discussed this at the Nov 28 CG meeting. Some notes:

RyanLamansky commented 6 years ago

Why not leave the current Data section and behaviors as they are, and make a new section for a runtime accessible source?

lukewagner commented 6 years ago

@RyanLamansky That's an interesting point. In particular, when one has the ability to freely copy any dynamic [begin, end) subrange range into linear memory, there seems to be little need for segments.

rossberg commented 6 years ago

@lukewagner, multiple segments would still be needed to support merging of modules.

lukewagner commented 6 years ago

@rossberg Ah, great point.

binji commented 6 years ago

@RyanLamansky I like this idea, it makes things much simpler.

Oh, I hadn't really thought about it, but do we want to add the init_table instruction before we have set_table (i.e. set element in the table)? Also, I assume we'll want a way to put an empty element in the segment. Or should that only be possible with a clear_table instruction?

rossberg commented 6 years ago

@binji, right, this might make most sense as a package of bulk instructions. I'd think:

(Here I wish we could rename grow_memory and current_memory to mem.grow and mem.size. :) )

Something like set_elem (or table.set) is in a different category, unfortunately, because it requires non-trivial extensions to the type system and further instructions to produce respective operands. But I think the above actually covers all current use cases already.

FWIW, I'd prefer not to split out sections by flags. That wouldn't be very scalable considering that we might add more future flags and attributes that can apply in different combinations.

RyanLamansky commented 6 years ago

(Here I wish we could rename grow_memory and current_memory to mem.grow and mem.size. :) )

Since instruction names aren't included in the binary format, this wouldn't be a breaking change...

lars-t-hansen commented 6 years ago

Would it make sense that segments that are copied programmatically into the memory can also be discarded programmatically? That way we avoid the issue around when a segment is discarded, and whether it can be used after the start function, and we open up for segments that have different uses: some for initializing data, some as data sources for other uses (Jukka mentioned lazy initialization of "filesystem" data).

That is, mem.drop might drop a numbered data segment. If the segment has been dropped when it is referenced the operation should trap, as already discussed

To be clear, I propose to have only two variants: what's in the MVP, and a new programmatic interface where all initialization and discarding is performed by the program. The option with the 'discard bit' does not seem worthwhile in light of the programmatic interface.

rossberg commented 6 years ago

@lars-t-hansen, interesting idea. Makes sense, simpler (in a way) and more flexible.

binji commented 6 years ago

OK, I'll write this up in the proposal.

EDIT: Ah, I just realized @rossberg's suggestion for mem.copy above is essentially the same as move_memory from the bulk-memory-operations proposal, and mem.clear is a more limited form of set_memory. I'll omit them from this proposal for now, but I'm curious if others think we should just pull them in.

lars-t-hansen commented 6 years ago

Just to reflect a conversation going on in #85: In my opinion, mem.init and mem.drop should only be usable with what's called "inactive" segments in that PR, ie, segments that are not copied into memory at instantiation time. If not, then mem.init will be sensitive to GC timing (an "active" segment referenced by mem.init might or might not have been removed by GC already because the module object was reaped) and instantiation will be sensitive to the timing of program execution, possibly in another thread (a concurrent mem.drop might have removed the segment). Either way we would need to spec failure semantics that we don't need if we simply say that "inactive" segments are for run-time use and "active" segments are for instantiation-time use. It should therefore be a verification failure for mem.init and mem.drop to reference segments that are not marked as "inactive".

rossberg commented 6 years ago

Agreed with @lars-t-hansen. (FWIW, I also think that "inactive" has the wrong connotation of being deactivated; "passive" would be more fitting.)

lars-t-hansen commented 6 years ago

I could absolutely live with "passive".

min4builder commented 4 years ago

So, as far as I understand, there have been 3 main proposals:

  1. To specify that (shared) memories are only initialized once; other threads skip that.
  2. To extend segments to allow them to be applied conditionally (with explicit condition expressions).
  3. (the current one) To have segments initialize by default, and be able to turn initialization off altogether, but then be able to do initialization programmatically.

(1) is the obvious one. I see no issues with it, it should be relatively easy to implement, and solves the problem neatly.

(2) also solves the problem. It is perhaps more flexible, though the extra flexibility is pretty much to be able to overwrite certain parts of memory depending on which thread it is, which may or may not be very useful.

(3) includes a more useful version of (2), since initializing segments could be useful at runtime. However, with any current threading API, the default is to share memory without reinitialization (since nobody does that in the first place), and as it stands, to implement this, a module would have to go through each segment individually and manually initialize every single one of them, though only at the first thread's initialization. This seems like going through a lot of effort to implement something really common.

My suggestion would be to keep the memory.init et al. operators from (3) (as part of the bulk memory operations proposal, since this is completely orthogonal to threads), but make (1) the default behaviour. This should be simple to implement (in theory only a few conditionals) and (hopefully) to specify. (And as far as I understand, it was only not chosen because of the extra flexibility from (2), which (3) renders moot).

(also sorry for waking this issue after so long)

rossberg commented 4 years ago

@min4builder, there are rather fundamental issues with (1). For example, a single memory can be imported into multiple modules, each of which may initialise different (or overlapping) parts of it. This may happen before the memory is read the first time, but new modules importing the memory may also be created later and apply new segments. They could be initialising other parts, maybe ones added by a memory grow. There (intentionally) are no particular restrictions. With all that, there is no meaningful definition of "initialised once", and a programmable solution like (3) is the only sufficiently general one.

min4builder commented 4 years ago

@rossberg I'm not sure I understand these. By "initialized once" I mean creating new instances of an already instanced module do not apply segments again. If a new module is imported, that would involve creating a new instance, which would apply the segments. Though now that I think about it, dynamically importing a module from one thread would not necessarily import it from other threads as well... But if it were imported, the memory would still not be reinitialized because there would already exist an instance. I still think it could work.

rossberg commented 4 years ago

Modules (as opposed to instances) are stateless objects. And they are copied between threads, not shared. Recognising what's a "reinstantiation" of the "same" module would require a fundamental change to this model, since it implies some form of global per-module state flag.

min4builder commented 4 years ago

Ah, I see now what you mean. The implementation is sharing memories only, and the "already initialised" bit would have to be per-module, not per-memory.

Though perhaps it is possible to adapt whatever mechanism there is to find shared memories to find existing instances? I'm not sure how the implementations work in this regard, and couldn't find anything about it in the spec, so I don't know.

I'm looking at this from a non-web embedder point of view, so implementing my suggestion from scratch might be a lot easier than adding it to existing implementations.