Open keithw opened 2 years ago
Regarding making the names prettier I think that can be considered separately. The problem with naive approach of using underscore as a separator is that underscore is valid as part of both module and field names and we need some way to distinguish these "mod_foo" "bar" from "mod" "foo_bar". But lets discuss that on a separate issue maybe?
I would like to suggest an alternative linking method that doesn't rely on the C static linker to resolve names at all. How about this:
e.g. Each module would require a struct to be passed to its instantaite
function:
struct Z_mymod_imports_t {
memory_t ...
table_t ..
func1_t ..
etc
}
struct Z_mymod_t {
Z_mymod_imports_t imports;
}
Z_mymod_instantiate(Z_mymod_imports_t* imports);
Then to call an import the codegen would do instance->imports.myfunc(..)
.
Ha, I think @yhdengh was here at one point in working on #1814. Here's why I think we hesitated:
If a module imports multiple things from the same modname, we wanted to enforce that the imports come from the same instance of that modname. E.g. if the importing module looks like this:
(module
(import "box" "set_width" (func (param i32)))
(import "box" "the_width" (global i32))
...)
... we don't think the embedder should have the freedom to give it a "set_width" function from one instance of the "box" module, and then a "the_width" value from a different instance of the same module.
If a module imports something with a given modname and entity name, we wanted to enforce that the entity name really corresponds to the export entity name from the exporting module. E.g. if the importing module looks like this:
(module
(import "exporter" "credit_account" (func))
(import "exporter" "debit_account" (func))
(import "exporter" "blow_up_the_world" (func))
...)
... we think the embedder shouldn't have the freedom to rearrange these same-typed functions -- they should have to have been exported under the corresponding names ("credit_account", etc.) by the exporting module.
This led us to the #1814 approach, where the "instantiate" function doesn't take every import individually -- instead, it takes a list of module instance pointers, one for each unique modname that is imported-from, and then the instantiate function goes and gets the correctly named export from each imported-from module instance.
The question then is how module A's "instantiate" function actually gets the export from a module instance of type B. Currently (and also post-#1814), module A makes an extern declaration for the import, and the linker is matching that up with the export defined by module B. My straw-man proposal was basically, stop having the importing module make its own declaration; have it actually #include
a declaration generated by the exporting module. Which seems safer, especially if backed up by a bunch of static asserts for things like memory limits. But I can't think of a way to enforce the above properties (efficiently) without involving the static linker somehow...
Ha, I think @yhdengh was here at one point in working on #1814. Here's why I think we hesitated:
- If a module imports multiple things from the same modname, we wanted to enforce that the imports come from the same instance of that modname. E.g. if the importing module looks like this:
(module (import "box" "set_width" (func (param i32))) (import "box" "the_width" (global i32)) ...)
... we don't think the embedder should have the freedom to give it a "set_width" function from one instance of the "box" module, and then a "the_width" value from a different instance of the same module.
I think embedder absolutely has this power. Import names are just strings and imported functions are just functions, there is nothing the forces the embedder to provide imported functions that are linked to any wasm instance at all. For example in emscripten we import everything from the env
module. Each imported function is a standalone JS functions and not associated with a module at all.
When you import foo
from bar
there is no guarantee that a module called bar
exists or that it is instantiated. it could just mean that the embedder needs to provide a function. As the embedder I could use a function called baz
to satisfy that dependency, its really up to me.
- If a module imports something with a given modname and entity name, we wanted to enforce that the entity name really corresponds to the export entity name from the exporting module. E.g. if the importing module looks like this:
(module (import "exporter" "credit_account" (func)) (import "exporter" "debit_account" (func)) (import "exporter" "blow_up_the_world" (func)) ...)
... we think the embedder shouldn't have the freedom to rearrange these same-typed functions -- they should have to have been exported under the corresponding names ("credit_account", etc.) by the exporting module.
This led us to the #1814 approach, where the "instantiate" function doesn't take every import individually -- instead, it takes a list of module instance pointers, one for each unique modname that is imported-from, and then the instantiate function goes and gets the correctly named export from each imported-from module instance.
Take a look at the wasm C API: https://github.com/WebAssembly/wasm-c-api
My understanding is that embedder has the freedom to provide arbitrary functions when instantiating a module.
The question then is how module A's "instantiate" function actually gets the export from a module instance of type B. Currently (and also post-#1814), module A makes an extern declaration for the import, and the linker is matching that up with the export defined by module B. My straw-man proposal was basically, stop having the importing module make its own declaration; have it actually
#include
a declaration generated by the exporting module. Which seems safer, especially if backed up by a bunch of static asserts for things like memory limits. But I can't think of a way to enforce the above properties (efficiently) without involving the static linker somehow...
In my proposal, when the imports are a struct of typed function pointers, wouldn't the embedder get a compile error if that try to assign a function with the wrong type to given import field. That is assuming that the C type system is has a unique type of each wasm type, which I think is the case, at l least today.
I think embedder absolutely has this power. Import names are just strings and imported functions are just functions, there is nothing the forces the embedder to provide imported functions that are linked to any wasm instance at all. For example in emscripten we import everything from the
env
module. Each imported function is a standalone JS functions and not associated with a module at all.
Fair enough -- maybe I should have written that I don't think we have to make it easy for the embedder to link modules in this fashion, where a module that imports an API from a particular "module name" ends up with some functions calling into one instance of that module and the rest of the functions (or globals/memories/tables) accessing another instance, leading to unexpected/inconsistent behavior.
But if the embedder really wants to claim this power, fine with us. I do think this is possible in the #1814 approach (please see below).
Of course we agree that the embedder should be able to expose arbitrary host functions to wasm2c-generated modules -- in this case the "module instance" parameter is basically an opaque pointer that can mean whatever the host wants it to mean. It could just be an empty type that grants access to the host functions, but if the host puts the per-instance context inside the pointed-to structure, then it can run multiple instances of the host API concurrently. (We give an example of this in the #1814 README: https://github.com/fixpointOS/wabt/blob/module_instancing/wasm2c/README.md#instantiate-multiple-instances-of-a-module)
Take a look at the wasm C API: https://github.com/WebAssembly/wasm-c-api
My understanding is that embedder has the freedom to provide arbitrary functions when instantiating a module.
Sure, I probably shouldn't have been so definitive in my language, but if module A exports functions named "credit" and "debit", and module B imports functions named "credit" and "debit" from modname "A", I don't think we need to make it as easy for the embedder to swap these when linking the modules together as it would be to link them under matching names.
I don't object to the embedder having this freedom if it really wants to, and I think there is a way to do this even in the #1814 model which I outline below. I think it would probably be possible to implement (some of) the wasm C API on top of that.
In my proposal, when the imports are a struct of typed function pointers, wouldn't the embedder get a compile error if that try to assign a function with the wrong type to given import field. That is assuming that the C type system is has a unique type of each wasm type, which I think is the case, at l least today.
Yes, it would get a compiler error (plus probably some additional checks for the parts of the Import Subtyping rules related to memory and table min/max sizes).
Let me try to tease apart what I see as a few separate issues in the next comment.
I think there are a few semi-related questions here that maybe we're circling around. Let me try to tease some of them apart... 1 and 2 below are basically about "what gets stored in the imports structure" (and in the instance). 3 and 4 are about "what are the parameters to the initialize
function."
"Should memory, global, and table imports be stored in a modname_imports_t
structure?"
No objection here. After instancing, that structure would become a member of the modname_module_instance_t
structure.
"What about function imports? Should they be stored as function pointers in the [module_prefix]_imports_t
structure?"
With the current code, sure, no objection -- it's basically the same as today but not in the global namespace. But keeping this post-instancing would have some consequences in terms of performance and aesthetics/simplicity/space:
In terms of performance, there seem to be some pretty severe penalties from forcing every call to an imported function to go through a function pointer, once that function pointer is itself stored in a stack- or heap-allocated structure (like a module instance).
Here is a microbenchmark with a tight inner loop around two imported functions (https://github.com/keithw/wasm2c-ptr-perftest/blob/master/main_source.c#L14-L22). I ran this on a Google Cloud c2-standard-60 (Intel Cascade Lake, "Intel(R) Xeon(R) CPU @ 3.10GHz") with Ubuntu 22.04, with clang 14 (supplied by Ubuntu) as the x86-64 compiler, and clang tip-of-tree (as of today) as the Wasm compiler. All results below were with an LTO build.
Reproducing scripts are here: https://github.com/keithw/wasm2c-ptr-perftest
Results:
#1814
: 4.458 seconds#1814
, output modified so imported functions are function pointers stored in instance: 9.730 secondsmitigations=off
to turn off Linux's Meltdown/Spectre mitigations, and it didn't make a discernible difference. I saw a similar ~2x slowdown on a physical Intel server we have at Stanford, but the slowdown was only around 35-40% on two AMD machines I tried.)To summarize:
#1814
also has no performance penalty on this microbenchmark (indeed, it's a 5% speedup for reasons I haven't dived into).In terms of aesthetics/simplicity/space usage: in an instanced world, a function import is an import of a function instance (a "closure of the original function over the runtime module instance of its originating module," per https://webassembly.github.io/spec/core/exec/runtime.html#function-instances). So if we imagine a bunch of programs that each import the WASI API of 45 functions:
That feels like a fair bit of unnecessary and duplicated state to load down every instance structure with.
"In #1814
, the initialize
function takes (a) the instance being initialized, and then (b) a list of module instance pointers (one per unique imported modname). The initialize
function goes and finds the corresponding named exports by calling the export accessor functions on the given instances (which can be wasm2c-generated module instances, or arbitrary host context structures). What if an embedder wants to exert its freedom to select each import individually at runtime (per above conversation)?"
I believe this is possible even in the #1814
approach, for those embedders that wish to. The embedder would store the function pointers in its own context object (which the Wasm module supplies as the first parameter to any imported function), and then dispatch on that when the function call comes in at runtime. The same is true for memory/table/global imports -- the embedder can implement the corresponding export accessor functions however it wants. If the embedder wants to import things named "credit" and "debit" from one module and export them under the opposite names to another (or if it just wants to do that for some instances based on the passed-in context pointer), it's free to do that.
With this strategy, those applications that want individually selected imports can have it, and only those applications would pay the performance penalty of dynamic dispatch (and the space/simplicity penalty of holding on to all those function and instance pointers). This seems like a way to make everybody happy in a way that distributes the burdens of complexity and possible slowdown on those who want/need it. We could provide more scaffolding in the wasm2c-generated output to make this easier if it would help, but I think it's already possible today in #1814
if the embedder plays along.
"Why not just have the modname_module_instance_t
structure become the argument to the initialize
function (after the instance being initialized), and get rid of those module instance pointers?"
modname_module_instance_t
holds the actual function pointers, then our concerns would be in (2) above.modname_module_instance_t
only holds global, memory, and table pointers, we'd get rid of the performance penalty, but the structure would still have to contain a module instance (or an opaque host type) pointer for every imported function. In the typical case (e.g. importing 45 functions from the WASI API), that means the modname_module_instance_t
structure now holds 45 identical/redundant pointers to a single WASI context structure. Which isn't the end of the world, but it just feels ugly for the majority of applications, and makes the instance structure a lot bigger.Been playing recently with wasm2c and I'm missing easy way to glue C code with webassembly. And there is couple of aspects of it:
Mentioned above generated in C external function pointer names
(import "env" "abort" (func $~lib/builtins/abort (param i32 i32 i32 i32)))
This currently produces a not easily readable names that have to be hand wired to C. If I was able to provide as wasm2c compilation argument information that every import from (in this case) "env" should use name as given - in this case "abort" - then the C code that will be produced will use this exact name "abort"
Pointers to functions vs. function declarations Above point will provide only naming simplification for selected namespaces, but these still need to be handwired/binded to proper pointers. So another option I would like to have is to produce code with function declarations instead pointers to function for selected "namespaces" ("env" in example above). That won't cover all the cases of wiring C with wasm C, but in many cases will reduce work significantly and improve performance of code.
So to summarize compilation flags and what would they do:
# generates code with readable names that are function declarations that have to be declared somewhere else in C without need to bind anything to pointer or so
wasm2c -generate-declarations-for env
# generates code with readable names that are function pointers that have to be assigned somewhere else in C with address of function
wasm2c -generate-delegates-for env
I think its highly unlikely that binding directly to normal system symbols such as "abort" would be desirable or functional.
Firstly, function argument are very unlikely to work as expected. For example, any function that takes a string or pointer as an argument will not work since that memory spaces don't align.
Secondly, wasm2c is a form of sandbox, and one doesn't normally want that sandboxed code to make direct calls to system functions. Creating wrappers/helpers/stubs and wiring them up explicitly is going to provide a lot more safety.
one doesn't normally want that sandboxed code to make direct calls to system
Ok, was not aware that sandboxing is one of the goals of this tool. Then it is not what I'm looking for. Thanks!
(Spinning this off from https://github.com/WebAssembly/wabt/pull/1814#discussion_r850917266)
Here is a proposal for how to improve safety when linking multiple modules together with wasm2c.
As background:
At present, for every import or export, wasm2c writes an extern declaration in the generated header file. The declaration is in the form of a global pointer, named with the module name and import/export name. The exporting module's
.c
file defines the actual pointer, as well as the actual (static) element itself. At runtime, the generatedinit_exports()
function initializes each pointers to point to the exported element.If #1814 is merged, the behavior remains essentially the same. (The difference is that exports are now represented by functions that take a module instance pointer, instead of with global pointers, but the naming and linking is basically the same.)
The concern in both cases is that, because both imports and exports produce a declaration in the generated
.h
file, and each module's.c
file only includes its own.h
file, the compiler won't see an incompatibility between the types of an export vs. import. (It became easier for this to happen after removing signature mangling from function and global imports/exports, but I think this was and still is an issue for memories and tables too.) The C linker will happily link the.o
files together as long as the symbol can be resolved.So it's pretty easy to get in a situation where compiling two wasm2c-generated modules each succeeds independently, and then linking the resulting
.o
files succeeds, but the user gets invalid behavior at runtime. This seems to violate the Wasm soundness guarantees.Here's a straw-man proposal that would try to catch these at compile-time instead:
.h
file. Declarations would be made by the exporting module only. For an import, the module's generated.h
file would#include
the.h
file of every module it imports from. Thewasm2c
command-line interface would require the user to specify the.h
filename for each imported-from module. (I'm a little scared of having a default here because it seems dangerous to write a#include
based on a modname that comes from the import component of a possibly adversarial wasm input.)wasm2c
generates a bunch of C11_Static_assert
statements to enforce the Import Subtyping validation rules (https://webassembly.github.io/spec/core/valid/types.html#import-subtyping). Usually these are just about type equality, but for memories and tables it has to compare the min and max sizes. This would mean that compiling a wasm2c-generated module would require a C11 compiler (or there could be a command-line flag to disable them...).If there is consensus to go in this direction, it would also be nice to prettify the naming of imports and exports. Maybe instead of
Z_modnameZ_name
, it could be something likew2c_modname_name
orwasm_modname_name
. (A distinguished prefix seems necessary to avoid letting modules import a system symbol automatically.)