wasm2c linking safety (enforcing import subtyping rules at compile time)

(Spinning this off from https://github.com/WebAssembly/wabt/pull/1814#discussion_r850917266)

Here is a proposal for how to improve safety when linking multiple modules together with wasm2c.

As background:

At present, for every import or export, wasm2c writes an extern declaration in the generated header file. The declaration is in the form of a global pointer, named with the module name and import/export name. The exporting module's .c file defines the actual pointer, as well as the actual (static) element itself. At runtime, the generated init_exports() function initializes each pointers to point to the exported element.
If #1814 is merged, the behavior remains essentially the same. (The difference is that exports are now represented by functions that take a module instance pointer, instead of with global pointers, but the naming and linking is basically the same.)

The concern in both cases is that, because both imports and exports produce a declaration in the generated .h file, and each module's .c file only includes its own .h file, the compiler won't see an incompatibility between the types of an export vs. import. (It became easier for this to happen after removing signature mangling from function and global imports/exports, but I think this was and still is an issue for memories and tables too.) The C linker will happily link the .o files together as long as the symbol can be resolved.

So it's pretty easy to get in a situation where compiling two wasm2c-generated modules each succeeds independently, and then linking the resulting .o files succeeds, but the user gets invalid behavior at runtime. This seems to violate the Wasm soundness guarantees.

Here's a straw-man proposal that would try to catch these at compile-time instead:

wasm2c stops making declarations for imports in the generated .h file. Declarations would be made by the exporting module only. For an import, the module's generated .h file would #include the .h file of every module it imports from. The wasm2c command-line interface would require the user to specify the .h filename for each imported-from module. (I'm a little scared of having a default here because it seems dangerous to write a #include based on a modname that comes from the import component of a possibly adversarial wasm input.)
For the importing module, wasm2c generates a bunch of C11 _Static_assert statements to enforce the Import Subtyping validation rules (https://webassembly.github.io/spec/core/valid/types.html#import-subtyping). Usually these are just about type equality, but for memories and tables it has to compare the min and max sizes. This would mean that compiling a wasm2c-generated module would require a C11 compiler (or there could be a command-line flag to disable them...).

If there is consensus to go in this direction, it would also be nice to prettify the naming of imports and exports. Maybe instead of Z_modnameZ_name, it could be something like w2c_modname_name or wasm_modname_name. (A distinguished prefix seems necessary to avoid letting modules import a system symbol automatically.)

Regarding making the names prettier I think that can be considered separately. The problem with naive approach of using underscore as a separator is that underscore is valid as part of both module and field names and we need some way to distinguish these "mod_foo" "bar" from "mod" "foo_bar". But lets discuss that on a separate issue maybe?

I would like to suggest an alternative linking method that doesn't rely on the C static linker to resolve names at all. How about this:

Exports as left as is, with whatever nice encoding scheme we choose (Z_modnameZ_name / w2c_modname_name / whatever)
Imports are make explicit at passing into the instantiation function rather than resolved by the linker. This means its up to the embedder to construct (possibly dynamiclly) the list of imports it wishes to passs to a give module. This gives the embedder a lot more power. For example I can now construct two instances of the same module, but with different imports. For memories and tables this is obviously very important but can be useful for globals and functions too.

e.g. Each module would require a struct to be passed to its instantaite function:

struct Z_mymod_imports_t {
   memory_t ...
   table_t ..
   func1_t ..
   etc
}

struct Z_mymod_t {
   Z_mymod_imports_t imports;
}

Z_mymod_instantiate(Z_mymod_imports_t* imports);

Then to call an import the codegen would do instance->imports.myfunc(..).

Ha, I think @yhdengh was here at one point in working on #1814. Here's why I think we hesitated:

If a module imports multiple things from the same modname, we wanted to enforce that the imports come from the same instance of that modname. E.g. if the importing module looks like this:
```
(module
(import "box" "set_width" (func (param i32)))
(import "box" "the_width" (global i32))
...)
```
... we don't think the embedder should have the freedom to give it a "set_width" function from one instance of the "box" module, and then a "the_width" value from a different instance of the same module.
If a module imports something with a given modname and entity name, we wanted to enforce that the entity name really corresponds to the export entity name from the exporting module. E.g. if the importing module looks like this:
```
(module
(import "exporter" "credit_account" (func))
(import "exporter" "debit_account" (func))
(import "exporter" "blow_up_the_world" (func))
...)
```
... we think the embedder shouldn't have the freedom to rearrange these same-typed functions -- they should have to have been exported under the corresponding names ("credit_account", etc.) by the exporting module.

This led us to the #1814 approach, where the "instantiate" function doesn't take every import individually -- instead, it takes a list of module instance pointers, one for each unique modname that is imported-from, and then the instantiate function goes and gets the correctly named export from each imported-from module instance.

The question then is how module A's "instantiate" function actually gets the export from a module instance of type B. Currently (and also post-#1814), module A makes an extern declaration for the import, and the linker is matching that up with the export defined by module B. My straw-man proposal was basically, stop having the importing module make its own declaration; have it actually #include a declaration generated by the exporting module. Which seems safer, especially if backed up by a bunch of static asserts for things like memory limits. But I can't think of a way to enforce the above properties (efficiently) without involving the static linker somehow...

Ha, I think @yhdengh was here at one point in working on #1814. Here's why I think we hesitated:

If a module imports multiple things from the same modname, we wanted to enforce that the imports come from the same instance of that modname. E.g. if the importing module looks like this:
(module
  (import "box" "set_width" (func (param i32)))
  (import "box" "the_width" (global i32))
...)
... we don't think the embedder should have the freedom to give it a "set_width" function from one instance of the "box" module, and then a "the_width" value from a different instance of the same module.

I think embedder absolutely has this power. Import names are just strings and imported functions are just functions, there is nothing the forces the embedder to provide imported functions that are linked to any wasm instance at all. For example in emscripten we import everything from the env module. Each imported function is a standalone JS functions and not associated with a module at all.

When you import foo from bar there is no guarantee that a module called bar exists or that it is instantiated. it could just mean that the embedder needs to provide a function. As the embedder I could use a function called baz to satisfy that dependency, its really up to me.

If a module imports something with a given modname and entity name, we wanted to enforce that the entity name really corresponds to the export entity name from the exporting module. E.g. if the importing module looks like this:
(module
  (import "exporter" "credit_account" (func))
  (import "exporter" "debit_account" (func))
  (import "exporter" "blow_up_the_world" (func))
...)
... we think the embedder shouldn't have the freedom to rearrange these same-typed functions -- they should have to have been exported under the corresponding names ("credit_account", etc.) by the exporting module.

This led us to the #1814 approach, where the "instantiate" function doesn't take every import individually -- instead, it takes a list of module instance pointers, one for each unique modname that is imported-from, and then the instantiate function goes and gets the correctly named export from each imported-from module instance.

Take a look at the wasm C API: https://github.com/WebAssembly/wasm-c-api

My understanding is that embedder has the freedom to provide arbitrary functions when instantiating a module.

The question then is how module A's "instantiate" function actually gets the export from a module instance of type B. Currently (and also post-#1814), module A makes an extern declaration for the import, and the linker is matching that up with the export defined by module B. My straw-man proposal was basically, stop having the importing module make its own declaration; have it actually #include a declaration generated by the exporting module. Which seems safer, especially if backed up by a bunch of static asserts for things like memory limits. But I can't think of a way to enforce the above properties (efficiently) without involving the static linker somehow...

In my proposal, when the imports are a struct of typed function pointers, wouldn't the embedder get a compile error if that try to assign a function with the wrong type to given import field. That is assuming that the C type system is has a unique type of each wasm type, which I think is the case, at l least today.

I think embedder absolutely has this power. Import names are just strings and imported functions are just functions, there is nothing the forces the embedder to provide imported functions that are linked to any wasm instance at all. For example in emscripten we import everything from the env module. Each imported function is a standalone JS functions and not associated with a module at all.

Fair enough -- maybe I should have written that I don't think we have to make it easy for the embedder to link modules in this fashion, where a module that imports an API from a particular "module name" ends up with some functions calling into one instance of that module and the rest of the functions (or globals/memories/tables) accessing another instance, leading to unexpected/inconsistent behavior.

But if the embedder really wants to claim this power, fine with us. I do think this is possible in the #1814 approach (please see below).

Of course we agree that the embedder should be able to expose arbitrary host functions to wasm2c-generated modules -- in this case the "module instance" parameter is basically an opaque pointer that can mean whatever the host wants it to mean. It could just be an empty type that grants access to the host functions, but if the host puts the per-instance context inside the pointed-to structure, then it can run multiple instances of the host API concurrently. (We give an example of this in the #1814 README: https://github.com/fixpointOS/wabt/blob/module_instancing/wasm2c/README.md#instantiate-multiple-instances-of-a-module)

Take a look at the wasm C API: https://github.com/WebAssembly/wasm-c-api

My understanding is that embedder has the freedom to provide arbitrary functions when instantiating a module.

Sure, I probably shouldn't have been so definitive in my language, but if module A exports functions named "credit" and "debit", and module B imports functions named "credit" and "debit" from modname "A", I don't think we need to make it as easy for the embedder to swap these when linking the modules together as it would be to link them under matching names.

I don't object to the embedder having this freedom if it really wants to, and I think there is a way to do this even in the #1814 model which I outline below. I think it would probably be possible to implement (some of) the wasm C API on top of that.

In my proposal, when the imports are a struct of typed function pointers, wouldn't the embedder get a compile error if that try to assign a function with the wrong type to given import field. That is assuming that the C type system is has a unique type of each wasm type, which I think is the case, at l least today.

Yes, it would get a compiler error (plus probably some additional checks for the parts of the Import Subtyping rules related to memory and table min/max sizes).

Let me try to tease apart what I see as a few separate issues in the next comment.

I think there are a few semi-related questions here that maybe we're circling around. Let me try to tease some of them apart... 1 and 2 below are basically about "what gets stored in the imports structure" (and in the instance). 3 and 4 are about "what are the parameters to the initialize function."

"Should memory, global, and table imports be stored in a modname_imports_t structure?"

No objection here. After instancing, that structure would become a member of the modname_module_instance_t structure.
"What about function imports? Should they be stored as function pointers in the [module_prefix]_imports_t structure?"

With the current code, sure, no objection -- it's basically the same as today but not in the global namespace. But keeping this post-instancing would have some consequences in terms of performance and aesthetics/simplicity/space:

In terms of performance, there seem to be some pretty severe penalties from forcing every call to an imported function to go through a function pointer, once that function pointer is itself stored in a stack- or heap-allocated structure (like a module instance).

Here is a microbenchmark with a tight inner loop around two imported functions (https://github.com/keithw/wasm2c-ptr-perftest/blob/master/main_source.c#L14-L22). I ran this on a Google Cloud c2-standard-60 (Intel Cascade Lake, "Intel(R) Xeon(R) CPU @ 3.10GHz") with Ubuntu 22.04, with clang 14 (supplied by Ubuntu) as the x86-64 compiler, and clang tip-of-tree (as of today) as the Wasm compiler. All results below were with an LTO build.

Reproducing scripts are here: https://github.com/keithw/wasm2c-ptr-perftest

Results:
- Current wasm2c: 4.682 seconds
- Current wasm2c, output modified so imported functions are actual functions: 4.683 seconds
- wasm2c in #1814: 4.458 seconds
- wasm2c in #1814, output modified so imported functions are function pointers stored in instance: 9.730 seconds
- (FWIW, I tried booting the machine with mitigations=off to turn off Linux's Meltdown/Spectre mitigations, and it didn't make a discernible difference. I saw a similar ~2x slowdown on a physical Intel server we have at Stanford, but the slowdown was only around 35-40% on two AMD machines I tried.)
To summarize:
- for now, the question of direct function calls vs. indirect calls doesn't seem to matter, perhaps because the compiler or processor is smart enough to recognize these write-once global function pointers for what they are.
- Applying #1814 also has no performance penalty on this microbenchmark (indeed, it's a 5% speedup for reasons I haven't dived into).
- But with module instancing, forcing all calls to imported functions to go through a function pointer is a 2.2x slowdown. This seems like a pretty heavy price to us -- at least for what we've been working on, we'd like imported functions to be quite low-friction. Admittedly, most programs are not going to do a tight inner loop around imported functions, but still.
In terms of aesthetics/simplicity/space usage: in an instanced world, a function import is an import of a function instance (a "closure of the original function over the runtime module instance of its originating module," per https://webassembly.github.io/spec/core/exec/runtime.html#function-instances). So if we imagine a bunch of programs that each import the WASI API of 45 functions:
- each instance is going to need to store all 45 function pointers, even though (in most cases) every instance of every program probably stores the same collection of 45 function pointers, and
- each instance is also going to need to store 45 pointers to a "wasi_snapshot_preview1_module_instance_t" type, even though (within probably every instance) every one of those 45 pointers will point to the exact same place: the host's context structure for representing the state associated with that instance's WASI interface.
That feels like a fair bit of unnecessary and duplicated state to load down every instance structure with.

"In #1814, the initialize function takes (a) the instance being initialized, and then (b) a list of module instance pointers (one per unique imported modname). The initialize function goes and finds the corresponding named exports by calling the export accessor functions on the given instances (which can be wasm2c-generated module instances, or arbitrary host context structures). What if an embedder wants to exert its freedom to select each import individually at runtime (per above conversation)?"

I believe this is possible even in the #1814 approach, for those embedders that wish to. The embedder would store the function pointers in its own context object (which the Wasm module supplies as the first parameter to any imported function), and then dispatch on that when the function call comes in at runtime. The same is true for memory/table/global imports -- the embedder can implement the corresponding export accessor functions however it wants. If the embedder wants to import things named "credit" and "debit" from one module and export them under the opposite names to another (or if it just wants to do that for some instances based on the passed-in context pointer), it's free to do that.

With this strategy, those applications that want individually selected imports can have it, and only those applications would pay the performance penalty of dynamic dispatch (and the space/simplicity penalty of holding on to all those function and instance pointers). This seems like a way to make everybody happy in a way that distributes the burdens of complexity and possible slowdown on those who want/need it. We could provide more scaffolding in the wasm2c-generated output to make this easier if it would help, but I think it's already possible today in #1814 if the embedder plays along.
"Why not just have the modname_module_instance_t structure become the argument to the initialize function (after the instance being initialized), and get rid of those module instance pointers?"

If the modname_module_instance_t holds the actual function pointers, then our concerns would be in (2) above.
If the modname_module_instance_t only holds global, memory, and table pointers, we'd get rid of the performance penalty, but the structure would still have to contain a module instance (or an opaque host type) pointer for every imported function. In the typical case (e.g. importing 45 functions from the WASI API), that means the modname_module_instance_t structure now holds 45 identical/redundant pointers to a single WASI context structure. Which isn't the end of the world, but it just feels ugly for the majority of applications, and makes the instance structure a lot bigger.

Been playing recently with wasm2c and I'm missing easy way to glue C code with webassembly. And there is couple of aspects of it:

Mentioned above generated in C external function pointer names
```
(import "env" "abort" (func $~lib/builtins/abort (param i32 i32 i32 i32)))
```
This currently produces a not easily readable names that have to be hand wired to C. If I was able to provide as wasm2c compilation argument information that every import from (in this case) "env" should use name as given - in this case "abort" - then the C code that will be produced will use this exact name "abort"
Pointers to functions vs. function declarations Above point will provide only naming simplification for selected namespaces, but these still need to be handwired/binded to proper pointers. So another option I would like to have is to produce code with function declarations instead pointers to function for selected "namespaces" ("env" in example above). That won't cover all the cases of wiring C with wasm C, but in many cases will reduce work significantly and improve performance of code.

So to summarize compilation flags and what would they do:

# generates code with readable names that are function declarations that have to be declared somewhere else in C without need to bind anything to pointer or so
wasm2c -generate-declarations-for env

# generates code with readable names that are function pointers that have to be assigned somewhere else in C with address of function
wasm2c -generate-delegates-for env

I think its highly unlikely that binding directly to normal system symbols such as "abort" would be desirable or functional.

Firstly, function argument are very unlikely to work as expected. For example, any function that takes a string or pointer as an argument will not work since that memory spaces don't align.

Secondly, wasm2c is a form of sandbox, and one doesn't normally want that sandboxed code to make direct calls to system functions. Creating wrappers/helpers/stubs and wiring them up explicitly is going to provide a lot more safety.

one doesn't normally want that sandboxed code to make direct calls to system

Ok, was not aware that sandboxing is one of the goals of this tool. Then it is not what I'm looking for. Thanks!

WebAssembly / wabt

wasm2c linking safety (enforcing import subtyping rules at compile time) #1908

Here's a straw-man proposal that would try to catch these at compile-time instead: