Thoughts on Improving Compilation Model

WebAssembly / design

WebAssembly Design Documents

http://webassembly.org

Apache License 2.0

11.4k stars 696 forks source link

Thoughts on Improving Compilation Model #1375

Open RossTate opened 4 years ago

RossTate commented 4 years ago

I foresee some upcoming needs to improve WebAssembly's compilation model (or at least its model within the web embedding). I have some rough ideas, but I thought it more important to get the conversation started than to pin down details, so here are some thoughts to discuss.

Compile-Time Imports

Currently all imports are provided at instantiation time. As such, they cannot affect how the module is compiled. This has the benefit of enabling multiple instantiations of the some module to share the same compiled code, but it has the cost of requiring all instance-specific data to involve a memory load and of preventing any instance-specific specialization of the compilation. My concern is that this benefit applies to only a few WebAssembly programs, and even then only to portions of those programs (i.e. only some imports will vary across instances), whereas the costs applies to many WebAssembly programs, and will become increasingly costly as more low-level coordination is done across (share-everything) modules through imports.

Some of the examples in #1354 can be modelled through (standardized) inlining of compile-time imported functions. But another upcoming example that's more specific to linking is field accesses. If we want to support better separate compilation (to WebAssembly), then many (object-oriented) languages will need some way to access public fields and methods of "imported" classes needing to be dependent on the full contents of those classes. That's likely best achieved through some mechanism for importing the class's wasm type abstractly and importing the appropriate field references, i.e. the offsets within the abstract type at which the public fields can be found. One would expect those imported field offsets to be incorporated into immediates of memory-load/store assembly instructions, but in the current compilation model they would have to be first fetched from the module-instance data and then dynamically added to the address. I believe the common case here is much more likely to be that all module instances within the same program will use the same abstract type for the imported class and the same field references for its public fields, and so compile-time imports have a much better cost/benefit tradeoff and provides the sort of implementation behavior people would expect here.

Staged Compilation

The above requires at least some separation between compile-time and instantiation-time imports. Similarly I expect other aspects of many modules to naturally separate into distinct stages of compilations. Each stage could be validated, compiled, and instantiated without even parsing the subsequent stages. The first stage would import and export compile-time information to coordinate low-level concerns like representations of data as well as host-generated magic-number identifiers like RTTs (and exceptions/events?). After that coordination is complete, a second stage could be used to to import the final results of that coordination and generate cross-instance (immutable?) globals and code. Maybe we could have some way to flag these stages as immutable so that they could be shared across threads without causing concerns about race conditions. A third stage could then have instantiation-time imports, specify globals that are intended to vary by module instance, and provide code that depends on these instance-specific components. (Note that only one stage can be instantiated multiple times without recompilation.)

Partial/Incremental Modules

Because staged compilation would let parts of modules be compiled in sequence, we can take things a step further and let each of these stages be shipped over the network incrementally. We could then consider modules sharing certain stages. For example, since many modules of an application will need to link to the same module instance providing the runtime and (minimized?) standard library of whatever the source language of the application was, they could all share the same import stage. That way the runtime/stdlib instance can be linked to that import once, and that effort shared across subsequent modules, rather than having to repeatedly link the runtime/stdlib for each module within the program. We would also be able to support JITing by representing JITed code as a (dynamically generated) stage for a module, say one that just specifies a function and exports that function as a reference that could then be stored in the relevant funcref variable/field. That is, a module can be "partial" in the sense that it (explicitly?) leaves the door open for more stages to be added later.

Wrapping Up

Hopefully that makes some sense. I'm happy to give more examples of the use cases I'm worried about and of the rough design strategy I have in mind. But I didn't give them at the moment in order to keep this shorter. Apologies if I abridged my thoughts too much.

tlively commented 4 years ago

The benefits of compile time imports are clear to me, but I'm having trouble wrapping my mind around how the different stages could be represented, both in the binary format and in embedder APIs. I can imagine providing compile-time imports via an import argument added to the compile function, but just that change seems much more limited than this partial/incremental module idea.

I guess one way extend the current API to incremental instantiation would be to allow the instantiate call to supply fewer imports than the module requires, in which case it would return a PartialInstance that would need to be instantiated again. Only once all the imports have been satisfied would you actually get a normal Instance that exposes its exports. But this would only be useful if each intermediate stage performed additional codegen, taking the new imports into account, so it seems like these intermediate stages and their associated imports would have to be standardized. What would they be named? The JIT aspect seems different because it requires running user code before doing more compilation.

Also, what could different modules sharing an import stage look like? How could that be coordinated? Clearly this would be very different from how imports and exports currently work.

kripken commented 4 years ago

Talking to @fgmccabe about a related idea to Compile-Time Imports, an "inline" property on a function could achieve similar things I think:

An imported function with "inline" on it will be inlined at link time. That is, when providing imports at link time, we would not just call out to an inlined import normally, but the VM would inline the import and optimize around it, as much as it can. This is not observable except for speed, so it is just a compilation hint, telling the VM where it makes sense to do some compilation work at link time.

IIUC the idea of Compile-Time Imports, this is sort of the reverse: in that idea some imports can be provided at compile time, while with "inline" some import calls are compiled at link time. But the result is the same, that we compile code for those imports.

Note that "inline" can be generalized to all functions:

A non-imported function with "inline" on it would be a function that the VM is encouraged to inline at compile time. Normally wasm VMs assume the toolchain has done this already. Telling the VM what to inline would increase compile time but potentially decrease binary size enough to justify that.

RossTate commented 4 years ago

The benefits of compile time imports are clear to me, but I'm having trouble wrapping my mind around how the different stages could be represented, both in the binary format and in embedder APIs.

Regarding binary format, something that could work would be for a multi-stage module to have definitive stage begin and end markers. The start of each stage would specify what kind of stage it is, e.g. multi-instantiable or singly-instantiable (per compilation). Each stage would have sections that are collectively self-contained, i.e. the code section could only refer to imports/globals/functions that are defined in the same stage (or earlier stages). We might have certain constructs be usable only within certain kinds of stages. For example, if a module has an abstract value type import, then the relevant code needs to be compiled differently each time that value type is instantiated differently (because, say, i32 is 4 bytes whereas f64 is 8 bytes). As such, it makes sense for value-type imports (if they were to possible at all) to be permissible only in a singly-instantiable stage.

Regarding embedder APIs, something like compileStage could take the bytes of the stage's code, arguments for imports, and an instance of the previous stage (if any) and return either a Module or an Instance depending on what kind of stage it is.

Also, what could different modules sharing an import stage look like? How could that be coordinated?

We can very roughly view a program as a collection of modules that get compiled and linked together via instantiation. With staged compilation, we can view modules as a list of stages. Combining these two programs, we could view a program as a collection of lists of stages, or we could view a program as a DAG of stages that get compiled and linked together via instantiation. Because compileStage takes a reference to an instance of its "parent" stage, multiple distinct stages could be compiled with the same instance of their parent stage.

(You might be wondering what the difference between stages and modules is at this point. Modules are linked by matching imports and exports. A stage builds upon its parent by extension, and in particular has access to all the components of its parent and uses (and extends) the same indexing schemes as its parent.)

an "inline" property on a function could achieve similar things

Yes, and your making it explicit in the module (rather than guessed by the host) seems like a useful idea. Combining with an idea above, we could say that the inline property is only permitted on imported functions in singly-instantiable stages.

lukewagner commented 4 years ago

I also think staged compilation is a good idea. For a while I had a partial sketch worked out with @rossberg in the context of the Module Linking proposal, but then I removed it to scope things down and b/c I didn't have an immediate use case for it (I thought I did for a bit).

A rough sketch would be to add a new kind of module definition called a preimport whose arguments were supplied at a new semantic stage, let's call it module_preinstantiate(premodule, preimportArgs) : module, that was after module_decode (which would now produce a premodule instead of a module) and module_validate (which would now validate a premodule) and before module_instantiate (which, by taking a module, requires pre-instantiation to have already been performed).

From a JS API perspective, I was thinking perhaps:

WebAssembly.Module would still contain a module and thus continue to represent "code that has already been compile (mostly)"
WebAssembly.compile(Streaming) would be given an additional, optional, preimportArgs argument and perform module_preinstantiate (after the existing module_decode and module_validate steps)
the one-shot WebAssembly.instantiate(Streaming) would perform decoding, validation, preinstantiation and instantiation, with both (pre)instantiation pulling from the single importArgs
if we thought it would be beneficial, we could add a WebAssembly.Premodule and WebAssembly.precompile(), but I'm not sure it would be because, while it would allow factoring out the module_decode and module_validate steps, these are super-cheap compared to the work that now happens in module_preinstantiate and of course bytecode itself is already cacheable.

Considering caching, I think an engine could actually cache the compiled machine code of a preinstantiated module if the cache entry included the preimport arguments as keys in the cache (perhaps as a small LRU dictionary per source-bytecode-path to handle explosive cases). Having the preimports be semantically distinct from imports is useful for making the keys to this cache explicit.

Another reason for having preimports be distinct from imports is that preimports would be allowed in fundamentally more places than imports; in places that would normally be incompatible with the current compilation model. For example, a type import could be used as a direct field of a struct or array (as $T, not just a (ref $T)), which could address the GC performance concerns. In theory, a value import could even be allowed as the offset of a load or store.

One question is whether the set of preimport-able values is the same as importable values (i.e., externval). If "yes":

caching becomes less predictable b/c if the preimported value is a stateful entity like a function (closure), memory, table, global, etc, it's ambiguous what exactly the cache key is
module is today considered (cross-thread) shareable, allowing, e.g., postMessage() between web workers, but if preinstantiation can bind an un-shared object to a module, then module now needs a shared/un-shared distinction

Both are manageable and the upside is many more specialization opportunities, but it's at least worth considering the downsides and the alternative of limiting preimportable values to only those that are stateless (types, modules, primitive values).

kmiller68 commented 4 years ago

On the JS side how would this interact with ES module integration? Would users just not be able to use any compile time imports? That seems at least somewhat problematic, if we actually want WASM modules to integrate nicely with JS. Perhaps there's a simple answer though?

guybedford commented 4 years ago

On a somewhat (or maybe not) related note discussing the most flexible JS API for Wasm imports, I jestingly suggested to @kripken:

import module from './file.wasm';
const mod = await module.instantiate(importObj);
export function niceJSAPI () {}

as the most flexible approach to dealing with all the wiring concerns while we are still waiting for the interface types and module linking proposals and strangely he didn't consider it unreasonable in practical integration scenarios.

Just throwing out that I wonder if separating imports into two categories - dependency imports and runtime parameter imports, might then allow something like the above API in the JS integration as well for some sort of two-phase compilation integration into ESM, where the dependency imports would be handled by the esm integration, and the parameter imports would be specified by the user, their distinction marked in the binary.

In my mind a parameter import would be anything that could be considered to be stateful - memory imports or memory-bound imports would typically be parameters.

In the JS ecosystem we always treat true dependencies as executionally stateless as far as possible which is one of the difficult differences between the ESM and Wasm import models we have at the moment.

conrad-watt commented 4 years ago

@lukewagner, would this scheme also involve the definition of "module preexports"?

Maybe preimports could always be stateless things, but preinstantiation could additionally (optionally) be provided objects that fulfil some (instantiation-time) imports. The engine could either just treat this like a closure (i.e. compile with just the preimports and hold the other imported objects until full instantiation occurs) or choose to make compile-time optimisations based on some of the provided imports. Then the engine could control what its compiled (cached) code depends on and make informed trade-offs.

RossTate commented 4 years ago

These are all great thoughts! I'm worried, though, that I failed to focus the discussion by making it unclear what sort of problem I want to solve. So in an attempt to prevent the discussion going in too many directions at once, I'm going to talk a bit about what I'm not trying to solve here (even though it's closely related and completely made sense to bring up): I'm not trying to solve modules, e.g. how you declare imports and exports and how you match them up and so on. There are existing frameworks for that, such as existing wasm modules, ES modules, and the in-progress module-linking proposal, and my impression is that what I'm suggesting here is largely orthogonal to those specifics (though certainly not entirely).

My concern is more specifically about compilation. That is, modules themselves seem to be made up of components (which I've been calling stages) that can be compiled in pieces, only one of which (if any) needs to be able to have multiple instantiations. Furthermore, these stages are not just importing but also exporting and also generating, and I want to bring more attention to those aspects as well in hopes that they'll give a more well-rounded sense of the concerns and possible strategies I have in mind.

As an example of exporting, let's consider something like a module providing hash tables, and let's suppose we want this module to be parameterized by value types for keys and values. This module will have imports for its key type, value type, equality function, and hash function. It will have an export for its hashtable type, iterator type, and various hashtable and iteration functions. For the sake of this discussion, let's suppose its exported types are value types rather than reference types. With a compilation model that only lets you compile an entire module at a time, there is no way to get the exported types without first compiling all of the hashtable functions. But in a staged model, you can have the sections for importing, defining, and exporting the types in a first stage, and the sections for importing, defining, and exporting the functions in a second stage. If a program were to do this type/functionality-factoring across its modules, then it could quickly compile and link the first stages of each of its modules, and then compile the second stages of all of its modules in parallel.

(Note that this example hashtable module has no need for multiple instantiation. All the state is contained in the hashtable. I suspect this pattern will become more common, especially with the GC proposal where modules will no longer need linear memory to maintain state.)

As an example of generating, consider "magic numbers". For composability purposes, various proposals have features that enable modules to demand a magic number of sorts that is unique to them, e.g. events, call tags, private types, and generative RTTs. While there is definitely utility in having these numbers be unique to the module, it is not clear that it is so useful for them to be unique to each instance (in the sense of the current compilation model). In fact, it seems quite problematic for some of the applications of multi-instantiation. For example, if each thread is a different instance, and each instance uses a different private type, then the application cannot share private information across threads (although of course there are other reasons we can't do this right now on the web). So here we really want a singly-instantiable stage that first generates the magic numbers followed by a multi-instantiable stage that has thread-local state but uses the same magic numbers across threads. (The hashtable example above might also use a private type, whose magic number would be generated in its first stage so that it can be incorporated in the exported type for hashtables.)

Hopefully these examples illustrate how I think the concerns I'm raising here are largely orthogonal to the concerns that, say, the module-linking proposal is addressing. Apologies for not making that clearer in the original post.

RossTate commented 4 years ago

Hmm, maybe a concise summary of the above is that I see modules as units of composition and stages as units of compilation, and although the two concepts are connected I see utility for WebAssembly in distinguishing them.

conrad-watt commented 4 years ago

The examples you give above could be addressed in @lukewagner's setup if preinstantiate were an API that gives you a preinstance/premodule with preexports (which could either be just all the exports of stateless things, or declared in a distinguished way). Is your point that you want to support an arbitrary number of preinstantiation stages?

The conversation has gone in this direction because changing Wasm's compilation model is a big deal, and we need to work out at a high level that there are no show-stoppers (considering existing uses, webcompat etc.). I think everyone is on a similar page about the benefits of (at least) a two-stage compilation model.

RossTate commented 4 years ago

Is your point that you want to support an arbitrary number of preinstantiation stages?

My primary point is that modules are not compilation units (they are a collection of compilation units). My secondary point is that multi-instantiable-compilation-everywhere is not well aligned with a more fine-grained compilation model.

As an example, consider dynamically generated code. Conceptually, this seems like it would be straightforward: you generate an exported func and give it the embedder to compile and get back a reference to the compiled exported function. But in WebAssembly, as #1369 points out, this is not so straightforward, and my sense is that the issue is fundamental to any system that treats modules as the smallest compilation unit.

If modules are compilation units, then to dynamically generate and compile a new function, you'll have to generate a module. The first thing that module will have to do is import everything it needs from the generating module. When you compile this module, you'll have to match all those imports with exports from the generating module. Each of this import/export matches will require a type-compatibility check as well as a copy as well as increase the size of the instance-specific data for the new function, so in dynamically generating this mini-module you might want to do some import-minimization work. Regardless of the specifics, at a high level the generation-and-compilation time might be dominated by optimizing/matching imports and exports. And after that the function has its own instance data (which is odd because we will never instantiate this dynamically generated module again), and every time it calls into the generating module (or vice versa) it will have to context-switch instance data.

If we have compilation units that are smaller than modules, like stages, then we can dynamically generate and compile the new function as an additional singly-instantiable stage to the generating module. It will use the same indexing scheme as the generating module and not need to do any importing or require the generating module to do any exporting. Because it's singly-instantiable, it does not need its own run-time instance data and can use the same instance data as the generating module, avoiding the need to context switch when calling into the generating module (or vice versa).

conrad-watt commented 4 years ago

If we have compilation units that are smaller than modules, like stages, then we can dynamically generate and compile the new function as an additional singly-instantiable stage to the generating module. It will use the same indexing scheme as the generating module and not need to do any importing or require the generating module to do any exporting. Because it's singly-instantiable, it does not need its own run-time instance data and can use the same instance data as the generating module, avoiding the need to context switch when calling into the generating module (or vice versa).

Oh, this is much more invasive change than the examples you described above. I guess abstractly it's like, in addition to the premodule phase changes above, allowing additional (code) sections to be validated+compiled in the context of an existing module/instance? This feature could be built on top of @lukewagner's suggestion (which I think looks like a reasonable MVP sketch).

RossTate commented 4 years ago

Abstractly, I want us to explore how to give applications more control over compilation. With that control they could implement @lukewagner's suggestion themselves, or they could implement the various examples that I provided that are not served by that suggestion. (Also, I'm not sure @lukewagner meant his comment to be interpreted as a complete solution to the various issues I raised.)

I also want us to revisit the idea that compilation units need to be always be instantiable multiple times. It has substantial costs, does not line up with the compilation model's of other systems, and I expect will not match forthcoming applications (especially when GC comes along and removes the need for implicit mutable state in the heap).

conrad-watt commented 4 years ago

Could you walk through an example of a cost incurred by allowing multiple instantiation which isn't solved by compilation-time imports?

The examples you've brought up throughout this issue are excellent, but any solutions must be broken down into incremental changes to Wasm's existing model. A productive conversation was starting in this direction, so I was surprised that you expressed worry that the issue was going off-topic.

RossTate commented 4 years ago

@conrad-watt I did not mean to give the impressions that I thought we had gone off topic. I foresaw a fork in the conversation coming up due to my failure to be clear on what problem I was trying to solve, and so I clarified that problem. I also gave examples to better illustrate my concerns, and am hoping @lukewagner will continue to share his thoughts, because he offered good ideas that I want to iterate on, but he has not yet had the opportunity to do so. In the meanwhile, I have been trying to illustrate to you why I do not believe the correct approach is to build layers on top of the current model, but rather to break the current model down into smaller parts. That is, I agree we need to consider how to develop the ideas as incremental changes, but there are multiple directions we can go in, and we should understand the problem better before we commit to a direction.

Could you walk through an example of a cost incurred by allowing multiple instantiation which isn't solved by compilation-time imports?

Rather than gets/sets of globals being to some fixed point in memory, they instead are a load/store from an offset from a dynamically determined address (the instance pointer).
Rather than a direct call to an imported function being simply a call to a dynamically determined address, you first have to fetch the instance pointer expected by that function from an offset of a dynamically determined address and then call the function. (This means decomposing a module into smaller modules that are separately instantiated incurs overhead in calls across the smaller units.)
Rather than funcref simply being a code pointer, it has to be a closure of an instance pointer and a code pointer (even though the vast majority of indirect calls are within the same module).

It's also unclear how the current compilation model plans to support magic numbers. If a module is allowed to demand a magic number, and that number needs to be different for each instance, various features will have to represent identifiers as pairs of magic numbers and instance pointers in order to be able to distinguish between instances.

lukewagner commented 4 years ago

@conrad-watt Yes, I forget to mention them, but preexports were also intended to complete the feature set.

@kmiller68 I think ESM-integration could be made to work by introducing a new intermediate phase in which wasm preimports were resolved bottom-up by wasm preexports in the ESM module graph (the limitation being that a JS module couldn't supply a preexport to a wasm module because it's before any JS top-level scripts have run). Maybe there's a different way, though.

@guybedford Hah, yeah, I've imagined that being useful too to handle more complicated scenarios without falling back to instantiateStreaming(fetch()). Maybe this could be achieved via a separate import attribute for importing an instance (the default) vs. a WebAssembly.Module?

@RossTate I'll start with the fine grained compilation issue in #1369. It has definitely been my (and I think others') assumption from the beginning that modules are far too coarse-grained for JIT-style compilation and, to serve those use cases well (which I agree we eventually must!), we'll need a new, smaller unit of compilation. But I think this is going to be a quite separate solution from the other problems you've outlined since, to do proper inline caching (IC) techniques, you need to be able to generate tiny bits of code and dynamically wire them up (with either machine-code-patched-direct- or data-patched-indirect-jumps, not calls) to execute in the context of a function activation. I've got a design for how this might work in a portable/safe manner brainstormed that is out of scope of this issue, but the tl;dr is that I think it requires a very different mechanism entirely oriented toward JIT compilation.

For the hash table use case, that seems like a thing that is squarely in the intended use cases of what I described: the hash table module would preimport the key/value types and preexport the table type and thus compilation of the hash table would happen separately for each distinct client key/value pair and clients would be compiled with full static knowledge of the hash table type. I also left this out in my original comment (it was long), but, when combining staged compilation with module linking, a module could also preimport a premodule, preinstantiate it and then alias the resulting preexports; this is how the client module would use the hash table module.

For the magic values, this could, I think, be achieved by allowing {tag, rtt, event} definitions to be declared to be created at preinstantiation-time, rather than at instantiation-time, so that they could be available for preexport or passing to preinstantiate.

Lastly, I appreciate the performance advantages you listed about single-instance. One realization I had is that the property that you want isn't that a module isn't instantiated multiple times but, rather, one of the two following properties:

a given machine code PC is statically associated with a single instance (hence instance state can be accessed as static data)
the set of instances in a single store is statically known at compile time (hence all instance state can be merged into one array pointed to by one register (or gs.base...) that is invariant across cross-instance calls and set when entering the store from the host)

Property 1 can be ensured in wasm today either by either code duplication (which, incidentally, is what my initial asm.js impl did) or via host-/implementation-specific circumstances that ensure a given wasm module is only loaded into a process once. But I don't see how this situation would be helped at the wasm spec level given that, in general, it's a host-specific choice for how wasm modules are loaded into processes and hosts are going to vary significantly in this regard.

For Property 2, Module Linking gets us most of the way there with its declarative instantiation DAG, with the hole being that, in a JS embedding today, or with dynamic instantiation APIs in the future, function references to instances outside the static DAG can become dynamically reachable. In some situations, such possibilities could be statically ruled out based on the signatures of imports/exports, but to handle the general case, I believe thunks can be used so that the instance-state-array is only changed when crossing the boundary.

RossTate commented 4 years ago

Thanks @lukewagner for that careful formulation of your thoughts! It is very useful.

But I don't see how this situation would be helped at the wasm spec level given that, in general, it's a host-specific choice for how wasm modules are loaded into processes and hosts are going to vary significantly in this regard.

I see the overarching problem as being at the intersection of wasm spec and embedding spec. The key thing is that we have an embedding (the primary web embedding) with applications and functionality designed around the expectation that instantiation is cheap. For example, my understanding is that the plan for threads is to use instance state as thread-local state, and so creating new threads corresponds to creating new instances, making it unreasonable to expect each instance to be compiled separately. This expectation in turn imposes constrains on the wasm spec, say by restricting what can be imported and how imports can be treated in the JS API for instantiate.

Yes, I forget to mention them, but preexports were also intended to complete the feature set. ... For the hash table use case, that seems like a thing that is squarely in the intended use cases of what I described

Sweet. For this you'll want some way to clearly separate the "pre" stage from the rest of the module. You'll want this both so that you can do the "pre" stage quickly (i.e. not have to parse a bunch of the module just to find out there are no more preexports) and because the "pre" stage can have different kinds of content (like value-type imports) because it's not expected to have cheap reinstantiation with different imports.

For the magic values, this could, I think, be achieved by allowing {tag, rtt, event} definitions to be declared to be created at preinstantiation-time, rather than at instantiation-time, so that they could be available for preexport or passing to preinstantiate.

Agreed. They should only be in sections that are expected to be recompiled per instantiation.

But I think this is going to be a quite separate solution from the other problems you've outlined since, to do proper inline caching (IC) techniques, you need to be able to generate tiny bits of code and dynamically wire them up (with either machine-code-patched-direct- or data-patched-indirect-jumps, not calls) to execute in the context of a function activation.

So I agree that inline caching likely requires even more mechanisms, but dynamic compilation does not mean inline caching. #1369 doesn't mention inline caching. A number of systems dynamically generate simple functions for a variety of reasons. Julia, for example, dynamically generates code as multimethods are dynamically updated in order to optimize code for the current overloadings of various multimethods. That functionality is critical to Julia's performance and makes no use of inline caching. Other languages dynamically generate code due to macros, eval, and so on. A number of these use cases also need to be able to dynamically generate other things like magic numbers (especially RTTs), not just functions.

the set of instances in a single store is statically known at compile time (hence all instance state can be merged into one array pointed to by one register (or gs.base...) that is invariant across cross-instance calls and set when entering the store from the host)

I find this unrealistic to expect to continue to hold, both because a known way for a large website to get faster load times is to fetch code on demand (or in the background), and because systems dynamically generate code (including RTTs and globals). If module (de)composition is a guiding design principle, then we should have a similarly (de)composable compilation+instantiation model. The problem is that you cannot compose instance states together after they have been created, so the current model is not composable, and we make up for that lack of composition through indirections incurring overhead.

My sense is that we have many similar thoughts on the topic. We both believe there's a need for compilation before and after what the current model provides, and we both believe that some constructs are only appropriate at certain stages. Where our beliefs seem to currently differ is in the role of the reinstantiate-without-recompile stage, and possibly in how much instance information can be expected to be known statically. My thought was that many modules do not need such a stage. That then led me to thinking that a module is in general a sequence of stages, one of which might support reinstantiate-without-recompile, and then that made me wonder what we could support if we treated stages as units that can be separated into distinct "files" and linked and composed directly.

RossTate commented 4 years ago

Random side thought: if we have a notion of different kinds of stages (e.g. whether or not it can be reinstantiated without recompilation) with some differences in their constructs, then maybe we could have kinds of stages that are embedder-specific. For example, maybe an Interface-Types adapter could be a special compilation stage? Or maybe we could have a JS-embedder-only compilation stage that decorates RTTs with JS fields/methods? Like dynamically-generated code, these use cases benefit from (and make sense to have) access to the internals of the module (or the "main" stages of the module) while at the same time should be separate from the "main" stages of the module (i.e. separation of platform-independent code and platform-specific/adaptive code).

lukewagner commented 4 years ago

@RossTate Thanks, lots of good thoughts.

For example, my understanding is that the plan for threads is to use instance state as thread-local state, and so creating new threads corresponds to creating new instances

That's the (somewhat embarrassing) current hack we use for web workers, but it isn't the plan for a first-class (pure wasm) threads. The only place I recall more-recent thinking being (partially) written up is threads/#138, but the basic idea is that a fork executes a function within an existing instance (using validation to ensure that only shared memories, tables, etc, can be accessed from multiple threads). With this direction, once instances are shared, global will no longer be thread-local and thus my expectation has been that we'll need to add a new kind of definition for thread-locals; but that's a whole separate discussion :)

So I agree that inline caching likely requires even more mechanisms, but dynamic compilation does not mean inline caching.

Agreed; I was only mentioning inline caching as an extremal (and also, very important) use case that requires a wholly new mechanism focused on JITing. However, thunks/stubs and inline patching wouldn't be the only part of such a feature; the ability to introduce new functions and other definitions would be necessary too, and in this context, I agree that caching and repeated instantiation aren't necessary and thus we'd leverage that fact to optimize for pure JITing performance.

I find this unrealistic to expect to continue to hold, both because a known way for a large website to get faster load times is to fetch code on demand (or in the background)

Yes, I've spent some time talking to web folks about these kinds of use cases. What's important is that this lazy fetching of modules shouldn't attempt to be done transparently (like on-first-call); rather, it needs to be done very explicitly so that the eagerly-loaded code can asynchronously wait (i.e., not block, remain responsive, handle failure) for the fetch+compile of the lazily-loaded code to be complete. What this all means is that the lazily-loaded code must be connected using first-class function references (as with dlopen), not module linking. Thus, each fetchable unit (sub-DAG of instances) could use module linking and thus be fused together and the thunking technique I mentioned (for Property 2 in my last comment) could handle the transitions between them.

My thought was that many modules do not need such a stage.

In a number of the emerging wasm-outside-the-browser host environments, I think multiple instances of the same code is actually the norm (and a major perf benefit over pure JITing environments which would duplicate machine code on each task). Even in browsers, with code caching (both in-memory and persistent), I think you end up wanting cheap repeated instantiation. And, more generally, while what's popular on the web today is wasm modules being these big monolithic applications, I think a more futuristic and attractive ecosystem is where an application is composed of many tiny wasm modules in the Unix-small-process style (but this time with types and much faster "IPC"), and in this style, you definitely have repeated instantiation (both at load time and run time). So "cheap instantiation" is a very important property to preserve in our design thinking, I believe.

maybe we could have kinds of stages that are embedder-specific

In some sense, we already do, if you consider ESM-integration and the JS API :) My hope is that core wasm could avoid host-dependent stages but rather just expose a minimal set of embedding APIs that hosts embedding spec can then plug into as appropriate.

guybedford commented 4 years ago

@guybedford Hah, yeah, I've imagined that being useful too to handle more complicated scenarios without falling back to instantiateStreaming(fetch()). Maybe this could be achieved via a separate import attribute for importing an instance (the default) vs. a WebAssembly.Module?

My suggestion was explicitly such a layered construct for treating the ESM imports as the first phase compilation only, and thus to always have the ES module respresentation of a Wasm module be that higher-order instantiation interface that is either an instance factory or a higher level instance interface of some kind. And for the default ESM-linked imports to always be the preimports / first phase only.

I'm bringing all this up because I believe the ESM integration should be treated as a first class embedding from a feature-completeness perspective, and as such directly relates to a lot of the points being discussed here.

It does seem almost silly yes, but I'm tending to think this type of layering could be more useful, since it also naturally supports passing to workers (despite your referring to this approach as a hack) etc. Other instance-level functions for changing imports might even be useful. It might even give more flexibility with proper dynamic linking scenarios when used with dynamic import too.

guybedford commented 4 years ago

Thinking about the above ESM integration idea further, of course if any imports were attached to such a partial Wasm import it's no longer transferrable to workers so it isn't an uninstantiated Module import you are getting, but rather either an instance factory, or a partially bound single instance. Either of those could fit the bill for what I'm proposing in the ESM integration embedding but I do think these type of embedder ecosystem concerns affect these boundary discussions, even if it may seem somewhat tangential to the main thread here.

lukewagner commented 4 years ago

@guybedford That's an interesting idea regarding switching of the default meaning of ESM-integration; although probably better to discuss separately in ESM-integration. Regarding the transferability of bound modules, that's what I was asking about in my final question in my first comment. I think it could be resolved by having, similar to the shared vs non-shared memories where only shared memories are transferrable, shared vs non-shared modules (with shared being the backwards-compatibility-required interpretation of what we have today) and then preinstantiating a premodule with a JS value would necessarily produce a non-shared module.

rossberg commented 4 years ago

In the most general version of what @lukewagner describes, there would be a pre-version of every form of module-level declaration: imports, exports, definitions. Inside a module, the only observable difference between pre and proper declarations is that the former can be referenced in more places:

Pre declarations can only refer to other pre declarations, not to proper ones.
OTOH, pre declarations (specifically, types) may be referred to in some additional contexts inside proper ones where proper ones are not allowed.

Outside a module, pre imports need to be provided at compilation time, and pre exports can be accessed after that; proper declarations need to provided at instantiation time, and proper exports are accessible after that. This implies the new notion of a pre module, which is validated but not yet compiled.

Because compilation and instantiation are the only meaningful stages in the Wasm pipeline, there isn't much use in adding more stages to Wasm's core model. Any more fine-grained notion of import staging can be modelled externally by bundling a (pre) module with a set of resolved imports. We could even abstract that in the embedding API and add a notion module_bind meta function that implements a form of partial application of imports, similar to func.bind.

RossTate commented 4 years ago

@lukewagner and I had an offline discussion to clarify a number of the ideas raised. I am going to attempt to summarize that discussion for the group, but it was two days ago, it was a long discussion, and it meandered a lot, so apologies to @lukewagner for any inaccuracies in the summary!

One complicating factor is multiple tabs of the same page. At first this seemed to us to necessitate heap-based multiple instantiation so that the applications cannot interfere. But then we realized that they cannot share data or stacks since tabs cannot share, so they can share the same "magic numbers" without interfering there. If tabs are implemented through separate processes then they can use the same memory addresses for globals and the like without interfering as well. But we do not want to require that approach either. So caching compiled binaries for reuse across multiple tabs is a complicating factor, one that can be addressed through heap-based multiple instantiation, but one that does not necessarily require that solution and has options that neither of us had considered.

My understanding had been that the JS API was meant to prescribe a very specific compilation model, one in which compilation was supposed to be expensive and instantiation was supposed to be cheap. But @lukewagner (I believe) was of the mindset that browsers were expected to adapt their compilation model as they adapt to workloads, so that if multiple instantiation (within a tab) were to become infrequent then it could be made more expensive, e.g. duplicating and patching compiled code. This would address my concern of having the cost of instance-pointer indirection everywhere with no one actually using the benefit of the indirection.

We agreed that dynamic extension, whether due to fetching code on demand or due to generating code, could be expressed in the current model (with a "pre" stage). I think we both agreed that there was reason to believe that the current import/export model could be too inefficient for various patterns. @lukewagner indicated that his preference would be to wait to see data on that (with good reasons), and we felt that it wasn't the case that not having the feature would set the ecosystem on the wrong path, so I agreed with that planning strategy. So this is community feedback we should keep our ears open for. Depending on what we hear, we might want to add a way to "share" an import section across multiple (fetched-on-demand) modules, and/or add a "post" stage for easily adding new (dynamically-generated) code to a module after instantiation, or do nothing at all if everything turns out to work well enough.

We had some discussion of magic numbers and whether they must be shared across multiple instances or whether each instance should be able to have its own magic number. My sense is that there's value in having a discussion on this more broadly if only to bring awareness of the subtle issues to the CG more broadly. It's also detailed and hard to summarize 😅

Lastly, we agreed that there's a need for a "pre" stage, and that a "pre" stage is sufficient for the most pressing of those needs (with the understanding that the "pre" stage needs to be compileable without parsing the rest of the module for quick coordination of pre-stages). So that is likely the key action-item takeaway from our discussion, and possibly this whole issue. (I have since thought of a good use for a pre-pre stage, but that use is for much more fine-grained modules/compilation-units than what I think we currently need.)

I'm probably forgetting important topics we covered, as well as incorrectly portraying some of @lukewagner's sentiments despite my efforts. @lukewagner, please let me know of any corrections I should make!

titzer commented 4 years ago

For more context here:

V8's compilation model completely changed in the 3+ years from MVP to now. Originally it did full specialization of code to even the memory start/size of memory, with code patching of those immediates at instantiation time and memory.grow time. We decided to move fully away from code patching for all of V8 for security reasons, and because code patching is a complex and expensive operation. Not completely orthogonal, but tiering strategies completely changed over that time in almost all engines. Caching is obviously more effective if the code doesn't need to be patched.

I was very very surprised that going from having the memory base and length as a constant burned into all code to that being an indirection away as a field in a pervasive instance reference was less than 1% performance loss. (Of course, that was even before V8 was able to ship memory protection with trap handlers due to browser integration issues).

Globals have and probably always will be an indirection or two away from the instance reference, at least in V8. I would actually be very surprised if you could ever measure a performance difference by turning them into statically-known addresses. In fact, in one possible concurrency future, globals become thread-local, and thus cannot have statically-known addresses.

I would expect the sharing advantages of having on-disk-cacheable map-readonly compiled code will outweigh all other concerns for the foreseeable future, even though tiering is more complicated than it was.

I generally like the idea of getting reliable inlining of imports bound at some stage before instantiation, and before code caching to disk happens, as this is fairly critical for what I am doing right now. However, inlining as an engine-guaranteed optimization is a complex issue because it's not just the overhead of calls that makes this a powerful optimization, but the entire set of optimizations that compilers do after inlining, like CSE, check elimination, constant folding, etc. Results here are clearly going to vary depending on the design of the engine's top tier compiler.

However I mostly perceive this as an embedding API issue. Improving the power of the import mechanism is related but not the same issue as staging, IMO.

RossTate commented 4 years ago

Thanks for the useful history and thoughts, @titzer!