Closed alexcrichton closed 2 years ago
Is there any concern about DoS? I'm in particular imagining something like a component which parses user-controlled JSON, expecting it to correlate to one of a set of wit types. And in the case of a mismatch, logs the generated type name. Or some other pathway to user-controlled generation of these, if they can be exponentially large.
I'd also like to echo Alex's concerns with this mangling scheme. Per my limited understanding, these strings exist only to facilitate the "componentizer" (today wit-component
, tomorrow lld
?) by conveying the interface definitions and have no need to be in the output component. Additionally, with all the repeated identical types throughout the signatures, there's going to be a lot of wasted effort to dedupe them in the output component.
While certainly we can rename the imports and exports in the inner module to reduce the size of these strings in the output component (since they are really an implementation detail), I'd really prefer not having to touch the original module that's being wrapped as a component, other than perhaps stripping a custom section or two.
Unless there's another use case for the mangling other than conveying information to the "next step" in producing a component, I'd also favor sticking type definitions (or just the original wit
to be parsed again?) in a custom section that can be stripped by wit-component
in the output component. The import/export names could then simply be a function name paired with a function type index (e.g. foo: 0
), perhaps.
I think whether or not DoS is a concern depends on what's being implemented and where. For example I suspect a lot of the interface-to-mangled-name translation and back is probably developer-tooling-related where DoS is less of a security issue and more of a "well my builds are just slow" issue - which while still a problem is a bit less so. That being said as you mention it's tough I think to predict when/where this will be used and it's definitely a fact that any tool trying to performing the mapping will need to have limits baked into it which limits the size of the input that it can process or output it can produce (and from fuzzing experience it's very easy to forget these limits)
One of the major contraints for the overall operation here is that the output of LLVM/LLD as-is today needs to be able to produce a core module that can be "componentized". For example one idea I had was to possibly just have an actual comopnent type section as a custom section along with a custom import/export section mapping the core wasm's imports/exports into the component types, but I don't think that this format would be easy to produce with LLVM/LLD as-is. Even including the *.wit
might be a bit tricky depending how we're going to map the metadata of "the import module X is mapping to this wit file" or similar.
Agreed that, since the canonical ABI is a mostly toolchain-internal detail (that can be rewritten to short names in the final .wasm
output), we don't need to worry too much about pathological scenarios. But the errno
point is a really good one: that definitely feels "wrong" and it seems like we should be able to say errno
instead.
Independently, we had talked about finding a way to faithfully represent Wit type aliases in component binaries so that they can be reproduced from just the binary; if we had that ability, then it would be fairly natural to reference type aliases by name, reusing types the same way as Wit. But the problem is that there would be no direct way to map from a use of a type index (in a function signature) to its associated import/export name (that wouldn't be heuristic). This might be related to an alternative solution to #102 (although #102 is more concerned with resource types); I'll look into whether there might be a joint fix.
Yeah, I feel some tension between
If we do get the direct mapping between export names and type indices (which, based after some initial discussions today, does seem like a tentatively coherent idea), then there should be a unique representation, such that you wouldn't even call this a "compression scheme" but, rather, just a "faithful representation of the Wit". 3 is an important concern, and I think it would be addressed by subtyping, which would say that it's always compatible to (1) export a new type alias and (2) go back and forth between uses of type aliases and equivalent inline structural types. It would be an incompatible change to remove an exported type alias, but that sounds right, since it could break code using the type alias.
This is additionally compounded by the fact that mangled name strings are repeated (potentially many times) internally in components, both in instantiation argument names and in export aliasing.
This mangling scheme really doesn't sit well with me and is unnecessarily complicating the tools all to avoid having to feed wit files into a tool like wit-component
at build time (which wrapping tools like cargo component
have no problems doing so today).
I personally think name mangling should not be part of the component model proposal.
Personally I think that having a translation from a core wasm to component, in isolation, can be useful for a few reasons:
*.wit
files is not always trivial, in particular with WASI where standard libraries built with WASI won't want to ship *.wit
things as well (e.g. the Rust standard library will use something for WASI)*.wit
toolingI may be forgetting a few things here, too. Otherwise though I don't know if it's possible to solve this issue of exponential size and additionally keep these benefits I'm thinking of.
Personally I don't think we should cast this aside just yet, but I do think it's more urgent than other issues to figure this out. If there's no great solution to this then I think it should be removed from the component model -- but this requires recalibrating plans for other tooling such as wasm-componentize
.
I should perhaps amend my statement that this iteration of name mangling should not be included in the proposal.
I don't think it's valuable to implement tooling around something we know is very far from a reasonable scheme; it's just wasting effort.
The bottom line is I don't think we can adequately and efficiently encode the type information in import and export name strings alone. We need to come up with a better solution before moving forward on an implementation of a tool that consumes this information.
Just to give a bit more concrete details of what I was thinking about above, let's say we make these changes to the Component Model (semantics and Canonical ABI):
sort
's index space, exports could do likewise. For validation and compilation purposes, this export
-created index would work the same way as an alias
(new index referring to the same definition). However, you'd get this one extra bit of name information (that did not impact internal validation, but did bindings and Canonical ABI).(global (import/export "<name> = <mangled deftype>") i32 (i32.const 0))
(instead of failing, as it does now).import
- or export
-created type-index, the Canonical ABI would use the name encoded by (2) instead of expanding the definition inline as it does now.With those changes to the Component Model, then Wit-to-Component compilation could preserve name information by emitting uses of imported/exported type indices.
I still can't claim to have thought and talked through this in detail, but does that make sense?
Honestly, that sounds much more complicated for the tooling than sticking everything in a custom section (or multiple) with an expressive format — one not inherently limited to core wasm semantics — that can ultimately be discarded by the componentizer when it's done.
One important problem that I think this issue highlights is that this isn't really a detail of the Canonical ABI but, rather, a lack of expressivity in component-level types as well. In particular, if we have a goal of being able to recover a Wit file from a component binary, then I think we need step 1 to "wire up" the names as they were originally in the Wit. Given that, steps 2 and 3 seem like "business as usual" with the Canonical ABI (such that, if we wanted to use custom sections, we should do it for everything, not just types).
Can you elaborate on what you think is missing from the binary format today that prevents us from synthesizing a wit definition from a component file?
We have tooling that does exactly this already implemented, including emitting type aliases based on type export names.
And apologies if I'm coming off a bit agitated here, but currently wit-component
is broken because of the this name mangling scheme, redundancy and inefficiencies in encoding aside.
The tool isn't currently validating the expected mangled imports/exports names nor does it flow the managed names through in instantiation arguments or aliasing (lots o' potentially long strings being repeated). While I'll get it fixed, it's rather annoying given the tool doesn't even need this information to operate.
I do agree that ultimately componetization shouldn't depend on the wit definitions, but I would like the proposal, and by extension the tooling, to encode this intermediary information as efficiently as possible both as in the intermediary core module, but also in the final component (ideally being completely absent; i.e. stripped from the inner core module).
The challenge I think is if I have a component:
(component
(type $T (list string))
(export "a" (type $T))
(export "b" (type $T))
(func (export "f") (param "x" $T) ... )
)
since there is no explicit relation between the type-index $T
and the assigned type-names a
and b
, then when synthesizing a Wit file from this component, I can't tell whether to render f : func(x: a) -> ()
or f: func(x: b) -> ()
; both Wits would map to the same component.
The tooling is expecting a unique type index for an aliased type in the wit, for example:
(component
(type $T (list string))
(alias outer 0 $T (type $A1))
(alias outer 0 $T (type $A2))
(export "a" (type $A1))
(export "b" (type $A2))
(core module $m
(memory (export "mem") 0)
(func (export "x") (param i32 i32) unreachable)
(func (export "realloc") (param i32 i32 i32 i32) (result i32) unreachable)
)
(core instance $i (instantiate $m))
(alias core export $i "x" (core func $x))
(alias core export $i "realloc" (core func $realloc))
(alias core export $i "mem" (core memory $mem))
(func (export "foo") (param "x" $T) (canon lift (core func $x) (realloc $realloc) (memory $mem)))
(func (export "bar") (param "x" $A1) (canon lift (core func $x) (realloc $realloc) (memory $mem)))
(func (export "baz") (param "x" $A2) (canon lift (core func $x) (realloc $realloc) (memory $mem)))
)
That should be enough for the tooling to print:
type a = list<string>
type b = list<string>
foo: func(x: list<string>) -> ()
bar: func(x: a) -> ()
baz: func(x: b) -> ()
Is that not sufficient to disambiguate? The tooling doesn't do so right now, but I suspect that's a bug (it's also out of date with the wit it prints).
That's a good way to encode Wit into component types, but if the goal is for Wit to be isomorphic to component types (such that we can ask for the .wit
for any valid component .wasm
), then .wasm
-to-.wit
can't rely on a type index to only be exported once.
It seems to me that there's a myriad of problems of having full isomorphism with respect to wit <-> component types
as the former is name-based and the latter is index-based with entirely optional naming.
For example, you could define a record type in a component and refer to the record type by type index in a function type without ever exporting the record type with a name. As records (any many other wit
types) must have a name in a wit
definition (otherwise, how would you refer to it?), it's not possible to emit a wit
record definition for such a component without giving it a synthesized name.
One might argue that one could not have produced such a component with a wit
definition to begin with (i.e. with an "unnamed" record), so using a synthesized name remains isomorphic...in a sense.
Currently our tooling errors when it encounters this as the wit-based componentization tool should have exported any record types in the output component.
Personally, I think wit
and wit
-based tooling should work with a subset of components (i.e. those that are produced originally with wit
-based tooling) with the only expectation of isomorphism being that one can get the same wit
definition out that was originally given to the wit
-based tooling; thus the tooling can encode and enforce restrictions to make that happen, not the component model itself.
To me, at least, the text and binary formats are the isomorphic representations of types in the component model; wit
is just an entirely optional tool for making authoring interfaces more palpable to humans (vs. using the text format directly).
At any rate, this discussion is a bit off in the weeds now and probably belongs in another forum, sorry about that 😞.
I think it's a valid question of whether having an isomorphism is a goal and a good one to discuss here so no worries about that. From my perspective, I think it would be pretty valuable to the ecosystem, since it would mean that Wit "covered" the whole surface area of the component model and vice versa. It would mean that there would never be a need to drag along a separate .wit
file which can simplify the toolchain, as Alex was pointing out above. Wit would effectively become the language for looking at a component "from the outside"; you'd never need to dip down to the see the underlying signature as an S-Expression.
For the cases you brought up, it seems like they could be addressed by expanding Wit to be a bit more uniform in its treatment of the different kinds of types:
list
). This is what gets rendered in-place for function params/results that use a type index of a type that isn't exported or imported.record
, variant
, etc) with the understanding that all named top-level types get exported (not just type
definitions, making the only reason to have a type
definition to give a second name to a type). Thus, noone should ever have to ask "is this name Wit-internal or does it show up in bindings?"; all Wit names become available for bindings, which seems a bit simpler to learn.This was discussed today in the wit-bindgen meeting and to write down the conclusions (but correct me if I'm misremembering):
One possibility was to have core wasm modules embed a custom section which is, itself, the component type section. This solves the size problem and it could be correlated to imports/exports via a combination of the name of the section and mangling the import/export names. The problem with this is that not all toolchains may necessarily have the ability to generate a custom section. Namely C has no source-level ability to generate such a custom section, so this idea was deemed not feasible.
The primary solution that was settled on was one where mangled strings could declare named types in a way which means further references to the name don't have to fully expand the type. This would solve the size issue but would require a resolution pass over the set of mangled names within a component. AFAIK this was as far as we got on the design of this.
After discussing this more 1:1 with @pchickey, he pointed out a problem we'll have with the approach of "declaring reusable types as part of mangled name" around imports: the linker may discard an import if it isn't called.
Thus we won't know which linker attribute applied to an import's declaration in the original source will survive through to the core module and this makes it impossible to know where a type declaration should go at the point we're expecting to be mangling these names.
We again discussed the custom section approach, where the custom section would basically be an "interface-describing" component containing a component type section (for all the component model representation of the wit-declared types), an alias section (for type aliases), and an export section (for naming the types). Essentially a full-fidelity representation of the original wit definition, but using the component model binary format. The "mangled" name could then be as simple as <function name>: <custom-section-name>: <type index>
, where <type index>
is a function type index for the type sort from the component parsed from custom section <custom-section-name>
. The componentizer could then strip out these custom sections and rename the imports and exports to shorter strings in the final component.
As we previously discussed, this felt like a non-starter because we would not be able to include this information, at least initially, for languages like C where it isn't possible to annotate the source with a splat of bytes intended for a custom section.
However, what if we had a componetizer that initially supported both ways of componetizing a core module: either by ingesting a fully self-describing core module or by being explicitly informed of the interface definitions via parsing wit files?
The intention would be to eventually support only the self-describing core module as this would be what the component model proposal itself would describe, but for languages that can't yet, there's tooling around using the external wit definitions for componetization.
We could then perhaps work on getting support for custom section annotations in more compiler front-ends (e.g. clang?), eventually deprecating the need to support componentization via side-channel wit definitions.
I realize there's no obvious stand-out solution to this "exploding name mangling" problem, so I do want to keep this discussion going until we reach a consensus.
@alexcrichton @peterhuene and I talked a bunch about how to solve this today:
We decided that mangled strings with declared types are not viable for import functions, because a library like wasi-libc would have to declare the iovec
type in the func name for either the read
or pread
import, and reuse that name in the other. Later the linker could choose to include only one of those imports in a wasm binary. So, we have no way of guaranteeing a type name which is used in an import would actually be available in the full set of import/export names found in a linked wasm.
From there, we decided that the best possible path is to use a custom section in core wasm. Peter has already created a tool which gives a wit file an isomorphic binary encoding, over for the wit-component
crate's wit2wasm and wasm2wit tools. The binary format is a Wasm Component containing type, alias, and export sections.
Wit-bindgen can take that binary format and put it directly into Rust source using #[link_section="wit:$wit_name"] pub static FOO: [u8] = $contents;
, and wasm-ld will put those contents into that custom section in the core wasm. Alex is going to follow up with a wasm object file solution for wasi-libc. Mangled core wasm function names can then be of the form $func_name:$wit_name:$type_space_index
- the final element is an index into the type space of the component in the custom section, which gives the function type.
Then, wit-component
gets a new mode where, instead of expecting to be passed in a .wit
file, it extracts the custom wit sections from the core wasm binary. It can then de-mangle each function name knowing precisely (as it did when passed the wit file) the component type information, and it can finally encode all of the required type information into the emitted component. This ends up being pretty close to how wit-component behaves today! The provenance of the wit ASTs changes from reading-off-disk to decoding from the custom section, and the name mangling determines how we match core function names up to their component types.
That sounds good to me. Reusing the binary format (and supporting machinery) of the component model to encode the destination type directly does seem like a more direct solution to the immediate problem than hacking more information into the Canonical ABI.
If bindings are ultimately being generated from a single target world
, could the type section just contain the complete component type derived from this world
? That component type has the name
→type
mapping for all imports and exports, so there wouldn't be any function name-mangling required in the core imports/exports; they could just have name
. That wouldn't work if we were attempting to assemble the world piecemeal from many independent interfaces, but I'm not aware of any remaining use cases for that?
Independently of solving the immediate type-explosion problem, I still think we should discuss the goal of maintaining an isomorphism between Wit and component types, but that's less pressing and I expect is going to come up out of necessity as part of nailing down resource type imports and exports.
A world is more specific than a library (like wasi-libc) or crate (like std
, or fastly
) that uses interfaces available in many different worlds. So, just encoding a wit rather than a whole world gives us separate compilation.
Alex is going to follow up with a wasm object file solution for wasi-libc
I experimented with this and I believe that we can support C. While clang doesn't have suppot for a source-level definition that becomes a custom section in the final wasm binary, what we can do is something that looks like:
wit-bindgen
tool has a new output, wit.o
, which is a wasm object file that's an input to wasm-ld
wit.o
has the necessary custom section with the binary encoding of the component for the input *.wit
fileswit-bindgen
for C is to include this file in the final compilation which will cause the custom section to show up.This is an example of using wasm-encoder
to create wit.o
with a custom section called my-custom-section
. There's some affordances to force wasm-ld
to actually include the object file but that's just an internal implementation detail which won't affect the final binary.
Otherwise wanted to also say I think this is a great idea @peterhuene and @pchickey, I'm glad y'all discovered the but-what-import-is-actually-linked issue!
Assuming this solution sticks and thinking about how to close out this issue: we could simply remove the name mangling from the final section of CanonicalABI.md however, in the same way that the goal of the original name mangling scheme was to define a core-wasm-compilation-target that could be independently targeted across core-wasm-producer toolchains, allowing tools like wit-component
to be reused across them, I was thinking it might be similarly useful to define the custom-section scheme (which seems pretty easy given the reuse of the type binary format) in CanonicalABI.md. Does that make sense to folks?
If so, returning to my last question above ("could the custom section define a single target component type?"):
If the intention is to use wit-bindgen
to generate the custom section and if the high-level bit of the World-refactoring in wit-bindgen
is to have a single up-front World as an input to bindings generation (thereby allowing it to link up all the use
s and address the type-sharing problems), wouldn't we have the right component type up front? That does still leave open some questions regarding separate compilation of libraries (which indeed need to work in a wide variety of worlds), but I think that's a separate problem that has to be solved in any case (given the single-World-wit-bindgen
design) as part of a separate toolchain discussion. Or am I missing some additional constraints here?
Just to follow up on my previous comment to capture the counterargument explained in the last component tooling meeting: while there is a single up-front wit-bindgen
step that is meant to encompass the "whole" component, that step may produce a single .o
that is linked by ld
with other pre-compiled .o
s, and ld
will be blindly appending custom sections, and so you might end up with multiple in the final .wasm
handed to wit-component
.
Another point that came up that sparked some enlightenment for me was the idea that each of those .o
s can contain a single (binary-encoded) componenttype
(not lists of interfaces), and so what wit-component
is doing is unioning those componenttype
s which, with #109 and the addition of URLs to import/export names, is an explicitly-intended operation. So, when I compile my library into a .o
, I choose some world
which locally implies some kebab-names that I use in my library source code. The compiled .o
contains both my local name choice and the "global" URL name it uniquely corresponds to (within my .o
). If my library now gets linked into a component along with a different library that uses the same interface (identified by URL) but compiled against a different world that makes a different kebab-naming assignment (maybe to resolve some name conflict not present in the first world), there is no problem: First, ld
, oblivious to the special semantics of the URL, will simply include both imports and both componenttype
s, because the pair <kebab-name>:<URL>
is unique. But then wit-component
, as part of the world-unioning, will be aware of the URL
-uniqueness criteria and unify these two imports into a single import of the URL in the output component, rewriting indices in both libraries to refer to the same import index. (The selection of the kebab-name of this unified import is arbitrary, but wit-component would need some default-with-override mechanism.)
Is there anything left keeping this issue open at this point?
Ah no, I think this is all handled.
I haven't been following the name mangling discussions much but was reviewing https://github.com/bytecodealliance/wit-bindgen/pull/309 when I saw a change that got me thinking about the name mangling given a diff like this:
Two consequences that come to mind for this name mangling scheme are:
Primarily the stringify-the-whole-type expansion means that the output string can have an exponential size with respect to the input type. This has come up quite a bit when fuzzing the component model in Wasmtime where any tree-based (in a sense) representation like this quickly runs out of memory while fuzzing, even for reasonable (ish) inputs. For example in the diff above the
enum
is printed twice, once on each use. Especially with variants it's been seen to get reasonably large while reusing types a lot. To be clear though I don't have a specific use case in mind when I'm worried about there being exponential size, this is more of a future-compatibility concern where anything using the name mangling, especially coupled with fuzzing, will have to severly limit its input size to ensure that the string doesn't become too large.Otherwise more practically this means that for WASI I'd worry about first-impressions and the size of import strings and modules. In WASI today basically all functions return
result<T, errno>
whereerrno
is a large enumeration (~75 I think as of now) which means that every WASI function would be imported with a nearly 1kb long string. While not really a technical limitation per se this seems like it's going to raise a lot of eyebrows when a "minimal component" will have so many large strings embedded within it.The first issue could hypothetically be fixed with an improved mangling scheme that implicitly defines type references. For example the above function could be mangled to something like
roundtrip-enum: func(a: enum { a, b, c }) -> t0
where. This be a significant increase in complexity, however. The second issue I'm not sure how to fix other than trying to maintain state across demangling an entire module. Although perhaps one idea would be to not use import/export names to contain name mangling for the core module -> component translation but instead have a separate custom section which looks more like the component type section which is used to assign component model types to the core wasm imports/exports. (the translation would then largely "just" remove the custom section and wrap it in a component with the same contents)