String size of mangled function names could be exponentially large

alexcrichton commented 2 years ago

I haven't been following the name mangling discussions much but was reviewing https://github.com/bytecodealliance/wit-bindgen/pull/309 when I saw a change that got me thinking about the name mangling given a diff like this:

-   #[link_name = "roundtrip-enum"]
+   #[link_name = "roundtrip-enum: func(a: enum { a, b, c }) -> enum { a, b, c }"]
    fn roundtrip_enum(a: i32) -> i32;

Two consequences that come to mind for this name mangling scheme are:

Primarily the stringify-the-whole-type expansion means that the output string can have an exponential size with respect to the input type. This has come up quite a bit when fuzzing the component model in Wasmtime where any tree-based (in a sense) representation like this quickly runs out of memory while fuzzing, even for reasonable (ish) inputs. For example in the diff above the enum is printed twice, once on each use. Especially with variants it's been seen to get reasonably large while reusing types a lot. To be clear though I don't have a specific use case in mind when I'm worried about there being exponential size, this is more of a future-compatibility concern where anything using the name mangling, especially coupled with fuzzing, will have to severly limit its input size to ensure that the string doesn't become too large.
Otherwise more practically this means that for WASI I'd worry about first-impressions and the size of import strings and modules. In WASI today basically all functions return result<T, errno> where errno is a large enumeration (~75 I think as of now) which means that every WASI function would be imported with a nearly 1kb long string. While not really a technical limitation per se this seems like it's going to raise a lot of eyebrows when a "minimal component" will have so many large strings embedded within it.

The first issue could hypothetically be fixed with an improved mangling scheme that implicitly defines type references. For example the above function could be mangled to something like roundtrip-enum: func(a: enum { a, b, c }) -> t0 where. This be a significant increase in complexity, however. The second issue I'm not sure how to fix other than trying to maintain state across demangling an entire module. Although perhaps one idea would be to not use import/export names to contain name mangling for the core module -> component translation but instead have a separate custom section which looks more like the component type section which is used to assign component model types to the core wasm imports/exports. (the translation would then largely "just" remove the custom section and wrap it in a component with the same contents)

kulakowski-wasm commented 2 years ago

Is there any concern about DoS? I'm in particular imagining something like a component which parses user-controlled JSON, expecting it to correlate to one of a set of wit types. And in the case of a mismatch, logs the generated type name. Or some other pathway to user-controlled generation of these, if they can be exponentially large.

peterhuene commented 2 years ago

I'd also like to echo Alex's concerns with this mangling scheme. Per my limited understanding, these strings exist only to facilitate the "componentizer" (today wit-component, tomorrow lld?) by conveying the interface definitions and have no need to be in the output component. Additionally, with all the repeated identical types throughout the signatures, there's going to be a lot of wasted effort to dedupe them in the output component.

While certainly we can rename the imports and exports in the inner module to reduce the size of these strings in the output component (since they are really an implementation detail), I'd really prefer not having to touch the original module that's being wrapped as a component, other than perhaps stripping a custom section or two.

Unless there's another use case for the mangling other than conveying information to the "next step" in producing a component, I'd also favor sticking type definitions (or just the original wit to be parsed again?) in a custom section that can be stripped by wit-component in the output component. The import/export names could then simply be a function name paired with a function type index (e.g. foo: 0), perhaps.

alexcrichton commented 2 years ago

I think whether or not DoS is a concern depends on what's being implemented and where. For example I suspect a lot of the interface-to-mangled-name translation and back is probably developer-tooling-related where DoS is less of a security issue and more of a "well my builds are just slow" issue - which while still a problem is a bit less so. That being said as you mention it's tough I think to predict when/where this will be used and it's definitely a fact that any tool trying to performing the mapping will need to have limits baked into it which limits the size of the input that it can process or output it can produce (and from fuzzing experience it's very easy to forget these limits)

One of the major contraints for the overall operation here is that the output of LLVM/LLD as-is today needs to be able to produce a core module that can be "componentized". For example one idea I had was to possibly just have an actual comopnent type section as a custom section along with a custom import/export section mapping the core wasm's imports/exports into the component types, but I don't think that this format would be easy to produce with LLVM/LLD as-is. Even including the *.wit might be a bit tricky depending how we're going to map the metadata of "the import module X is mapping to this wit file" or similar.

lukewagner commented 2 years ago

Agreed that, since the canonical ABI is a mostly toolchain-internal detail (that can be rewritten to short names in the final .wasm output), we don't need to worry too much about pathological scenarios. But the errno point is a really good one: that definitely feels "wrong" and it seems like we should be able to say errno instead.

Independently, we had talked about finding a way to faithfully represent Wit type aliases in component binaries so that they can be reproduced from just the binary; if we had that ability, then it would be fairly natural to reference type aliases by name, reusing types the same way as Wit. But the problem is that there would be no direct way to map from a use of a type index (in a function signature) to its associated import/export name (that wouldn't be heuristic). This might be related to an alternative solution to #102 (although #102 is more concerned with resource types); I'll look into whether there might be a joint fix.

kulakowski-wasm commented 2 years ago

Yeah, I feel some tension between

making these smaller with this sort of compression/encoding scheme
defining a unique representation of names
compatibility in the face of innocuous-feeling refactors

lukewagner commented 2 years ago

If we do get the direct mapping between export names and type indices (which, based after some initial discussions today, does seem like a tentatively coherent idea), then there should be a unique representation, such that you wouldn't even call this a "compression scheme" but, rather, just a "faithful representation of the Wit". 3 is an important concern, and I think it would be addressed by subtyping, which would say that it's always compatible to (1) export a new type alias and (2) go back and forth between uses of type aliases and equivalent inline structural types. It would be an incompatible change to remove an exported type alias, but that sounds right, since it could break code using the type alias.

peterhuene commented 2 years ago

This is additionally compounded by the fact that mangled name strings are repeated (potentially many times) internally in components, both in instantiation argument names and in export aliasing.

This mangling scheme really doesn't sit well with me and is unnecessarily complicating the tools all to avoid having to feed wit files into a tool like wit-component at build time (which wrapping tools like cargo component have no problems doing so today).

I personally think name mangling should not be part of the component model proposal.

alexcrichton commented 2 years ago

Personally I think that having a translation from a core wasm to component, in isolation, can be useful for a few reasons:

Sheperding around *.wit files is not always trivial, in particular with WASI where standard libraries built with WASI won't want to ship *.wit things as well (e.g. the Rust standard library will use something for WASI)
Having "just target this name" provides an easier story for other languages that aren't as integrated with tooling as Rust is. For example importing component-model-related things into wasi-libc is easier if you just target a name rather than having to pulling in *.wit tooling
In some sense I feel like it provides a more clear migration path of sorts from core wasm modules to components where without this each language would sort of need its own system for creating a component.

I may be forgetting a few things here, too. Otherwise though I don't know if it's possible to solve this issue of exponential size and additionally keep these benefits I'm thinking of.

Personally I don't think we should cast this aside just yet, but I do think it's more urgent than other issues to figure this out. If there's no great solution to this then I think it should be removed from the component model -- but this requires recalibrating plans for other tooling such as wasm-componentize.

peterhuene commented 2 years ago

I should perhaps amend my statement that this iteration of name mangling should not be included in the proposal.

I don't think it's valuable to implement tooling around something we know is very far from a reasonable scheme; it's just wasting effort.

The bottom line is I don't think we can adequately and efficiently encode the type information in import and export name strings alone. We need to come up with a better solution before moving forward on an implementation of a tool that consumes this information.

lukewagner commented 2 years ago

Just to give a bit more concrete details of what I was thinking about above, let's say we make these changes to the Component Model (semantics and Canonical ABI):

Just like how each import appends a new index to its sort's index space, exports could do likewise. For validation and compilation purposes, this export-created index would work the same way as an alias (new index referring to the same definition). However, you'd get this one extra bit of name information (that did not impact internal validation, but did bindings and Canonical ABI).
For each type import/export, the Canonical ABI would produce a (global (import/export "<name> = <mangled deftype>") i32 (i32.const 0)) (instead of failing, as it does now).
For each use of an import- or export-created type-index, the Canonical ABI would use the name encoded by (2) instead of expanding the definition inline as it does now.

With those changes to the Component Model, then Wit-to-Component compilation could preserve name information by emitting uses of imported/exported type indices.

I still can't claim to have thought and talked through this in detail, but does that make sense?

peterhuene commented 2 years ago

Honestly, that sounds much more complicated for the tooling than sticking everything in a custom section (or multiple) with an expressive format — one not inherently limited to core wasm semantics — that can ultimately be discarded by the componentizer when it's done.

lukewagner commented 2 years ago

One important problem that I think this issue highlights is that this isn't really a detail of the Canonical ABI but, rather, a lack of expressivity in component-level types as well. In particular, if we have a goal of being able to recover a Wit file from a component binary, then I think we need step 1 to "wire up" the names as they were originally in the Wit. Given that, steps 2 and 3 seem like "business as usual" with the Canonical ABI (such that, if we wanted to use custom sections, we should do it for everything, not just types).

peterhuene commented 2 years ago

Can you elaborate on what you think is missing from the binary format today that prevents us from synthesizing a wit definition from a component file?

We have tooling that does exactly this already implemented, including emitting type aliases based on type export names.

And apologies if I'm coming off a bit agitated here, but currently wit-component is broken because of the this name mangling scheme, redundancy and inefficiencies in encoding aside.

The tool isn't currently validating the expected mangled imports/exports names nor does it flow the managed names through in instantiation arguments or aliasing (lots o' potentially long strings being repeated). While I'll get it fixed, it's rather annoying given the tool doesn't even need this information to operate.

I do agree that ultimately componetization shouldn't depend on the wit definitions, but I would like the proposal, and by extension the tooling, to encode this intermediary information as efficiently as possible both as in the intermediary core module, but also in the final component (ideally being completely absent; i.e. stripped from the inner core module).

lukewagner commented 2 years ago

The challenge I think is if I have a component:

(component
  (type $T (list string))
  (export "a" (type $T))
  (export "b" (type $T))
  (func (export "f") (param "x" $T) ... )
)

since there is no explicit relation between the type-index $T and the assigned type-names a and b, then when synthesizing a Wit file from this component, I can't tell whether to render f : func(x: a) -> () or f: func(x: b) -> (); both Wits would map to the same component.

peterhuene commented 2 years ago

The tooling is expecting a unique type index for an aliased type in the wit, for example:

(component
  (type $T (list string))
  (alias outer 0 $T (type $A1))
  (alias outer 0 $T (type $A2))
  (export "a" (type $A1))
  (export "b" (type $A2))
  (core module $m
    (memory (export "mem") 0)
    (func (export "x") (param i32 i32) unreachable)
    (func (export "realloc") (param i32 i32 i32 i32) (result i32) unreachable)
  )
  (core instance $i (instantiate $m))
  (alias core export $i "x" (core func $x))
  (alias core export $i "realloc" (core func $realloc))
  (alias core export $i "mem" (core memory $mem))
  (func (export "foo") (param "x" $T) (canon lift (core func $x) (realloc $realloc) (memory $mem)))
  (func (export "bar") (param "x" $A1) (canon lift (core func $x) (realloc $realloc) (memory $mem)))
  (func (export "baz") (param "x" $A2) (canon lift (core func $x) (realloc $realloc) (memory $mem)))
)

That should be enough for the tooling to print:

type a = list<string>
type b = list<string>

foo: func(x: list<string>) -> ()

bar: func(x: a) -> ()

baz: func(x: b) -> ()

Is that not sufficient to disambiguate? The tooling doesn't do so right now, but I suspect that's a bug (it's also out of date with the wit it prints).

lukewagner commented 2 years ago

That's a good way to encode Wit into component types, but if the goal is for Wit to be isomorphic to component types (such that we can ask for the .wit for any valid component .wasm), then .wasm-to-.wit can't rely on a type index to only be exported once.

peterhuene commented 2 years ago

It seems to me that there's a myriad of problems of having full isomorphism with respect to wit <-> component types as the former is name-based and the latter is index-based with entirely optional naming.

For example, you could define a record type in a component and refer to the record type by type index in a function type without ever exporting the record type with a name. As records (any many other wit types) must have a name in a wit definition (otherwise, how would you refer to it?), it's not possible to emit a wit record definition for such a component without giving it a synthesized name.

One might argue that one could not have produced such a component with a wit definition to begin with (i.e. with an "unnamed" record), so using a synthesized name remains isomorphic...in a sense.

Currently our tooling errors when it encounters this as the wit-based componentization tool should have exported any record types in the output component.

Personally, I think wit and wit-based tooling should work with a subset of components (i.e. those that are produced originally with wit-based tooling) with the only expectation of isomorphism being that one can get the same wit definition out that was originally given to the wit-based tooling; thus the tooling can encode and enforce restrictions to make that happen, not the component model itself.

To me, at least, the text and binary formats are the isomorphic representations of types in the component model; wit is just an entirely optional tool for making authoring interfaces more palpable to humans (vs. using the text format directly).

At any rate, this discussion is a bit off in the weeds now and probably belongs in another forum, sorry about that 😞.

lukewagner commented 2 years ago

I think it's a valid question of whether having an isomorphism is a goal and a good one to discuss here so no worries about that. From my perspective, I think it would be pretty valuable to the ecosystem, since it would mean that Wit "covered" the whole surface area of the component model and vice versa. It would mean that there would never be a need to drag along a separate .wit file which can simplify the toolchain, as Alex was pointing out above. Wit would effectively become the language for looking at a component "from the outside"; you'd never need to dip down to the see the underlying signature as an S-Expression.

For the cases you brought up, it seems like they could be addressed by expanding Wit to be a bit more uniform in its treatment of the different kinds of types:

Allow all types to be written anonymously inline (not just list). This is what gets rendered in-place for function params/results that use a type index of a type that isn't exported or imported.
Allow all types to be named top-level (not just record, variant, etc) with the understanding that all named top-level types get exported (not just type definitions, making the only reason to have a type definition to give a second name to a type). Thus, noone should ever have to ask "is this name Wit-internal or does it show up in bindings?"; all Wit names become available for bindings, which seems a bit simpler to learn.

alexcrichton commented 2 years ago

This was discussed today in the wit-bindgen meeting and to write down the conclusions (but correct me if I'm misremembering):

One possibility was to have core wasm modules embed a custom section which is, itself, the component type section. This solves the size problem and it could be correlated to imports/exports via a combination of the name of the section and mangling the import/export names. The problem with this is that not all toolchains may necessarily have the ability to generate a custom section. Namely C has no source-level ability to generate such a custom section, so this idea was deemed not feasible.
The primary solution that was settled on was one where mangled strings could declare named types in a way which means further references to the name don't have to fully expand the type. This would solve the size issue but would require a resolution pass over the set of mangled names within a component. AFAIK this was as far as we got on the design of this.

peterhuene commented 2 years ago

After discussing this more 1:1 with @pchickey, he pointed out a problem we'll have with the approach of "declaring reusable types as part of mangled name" around imports: the linker may discard an import if it isn't called.

Thus we won't know which linker attribute applied to an import's declaration in the original source will survive through to the core module and this makes it impossible to know where a type declaration should go at the point we're expecting to be mangling these names.

We again discussed the custom section approach, where the custom section would basically be an "interface-describing" component containing a component type section (for all the component model representation of the wit-declared types), an alias section (for type aliases), and an export section (for naming the types). Essentially a full-fidelity representation of the original wit definition, but using the component model binary format. The "mangled" name could then be as simple as <function name>: <custom-section-name>: <type index>, where <type index> is a function type index for the type sort from the component parsed from custom section <custom-section-name>. The componentizer could then strip out these custom sections and rename the imports and exports to shorter strings in the final component.

As we previously discussed, this felt like a non-starter because we would not be able to include this information, at least initially, for languages like C where it isn't possible to annotate the source with a splat of bytes intended for a custom section.

However, what if we had a componetizer that initially supported both ways of componetizing a core module: either by ingesting a fully self-describing core module or by being explicitly informed of the interface definitions via parsing wit files?

The intention would be to eventually support only the self-describing core module as this would be what the component model proposal itself would describe, but for languages that can't yet, there's tooling around using the external wit definitions for componetization.

We could then perhaps work on getting support for custom section annotations in more compiler front-ends (e.g. clang?), eventually deprecating the need to support componentization via side-channel wit definitions.

I realize there's no obvious stand-out solution to this "exploding name mangling" problem, so I do want to keep this discussion going until we reach a consensus.

pchickey commented 2 years ago

@alexcrichton @peterhuene and I talked a bunch about how to solve this today:

We decided that mangled strings with declared types are not viable for import functions, because a library like wasi-libc would have to declare the iovec type in the func name for either the read or pread import, and reuse that name in the other. Later the linker could choose to include only one of those imports in a wasm binary. So, we have no way of guaranteeing a type name which is used in an import would actually be available in the full set of import/export names found in a linked wasm.

From there, we decided that the best possible path is to use a custom section in core wasm. Peter has already created a tool which gives a wit file an isomorphic binary encoding, over for the wit-component crate's wit2wasm and wasm2wit tools. The binary format is a Wasm Component containing type, alias, and export sections.

Wit-bindgen can take that binary format and put it directly into Rust source using #[link_section="wit:$wit_name"] pub static FOO: [u8] = $contents;, and wasm-ld will put those contents into that custom section in the core wasm. Alex is going to follow up with a wasm object file solution for wasi-libc. Mangled core wasm function names can then be of the form $func_name:$wit_name:$type_space_index - the final element is an index into the type space of the component in the custom section, which gives the function type.

Then, wit-component gets a new mode where, instead of expecting to be passed in a .wit file, it extracts the custom wit sections from the core wasm binary. It can then de-mangle each function name knowing precisely (as it did when passed the wit file) the component type information, and it can finally encode all of the required type information into the emitted component. This ends up being pretty close to how wit-component behaves today! The provenance of the wit ASTs changes from reading-off-disk to decoding from the custom section, and the name mangling determines how we match core function names up to their component types.

lukewagner commented 2 years ago

That sounds good to me. Reusing the binary format (and supporting machinery) of the component model to encode the destination type directly does seem like a more direct solution to the immediate problem than hacking more information into the Canonical ABI.

If bindings are ultimately being generated from a single target world, could the type section just contain the complete component type derived from this world? That component type has the name→type mapping for all imports and exports, so there wouldn't be any function name-mangling required in the core imports/exports; they could just have name. That wouldn't work if we were attempting to assemble the world piecemeal from many independent interfaces, but I'm not aware of any remaining use cases for that?

Independently of solving the immediate type-explosion problem, I still think we should discuss the goal of maintaining an isomorphism between Wit and component types, but that's less pressing and I expect is going to come up out of necessity as part of nailing down resource type imports and exports.

pchickey commented 2 years ago

A world is more specific than a library (like wasi-libc) or crate (like std, or fastly) that uses interfaces available in many different worlds. So, just encoding a wit rather than a whole world gives us separate compilation.

alexcrichton commented 2 years ago

Alex is going to follow up with a wasm object file solution for wasi-libc

I experimented with this and I believe that we can support C. While clang doesn't have suppot for a source-level definition that becomes a custom section in the final wasm binary, what we can do is something that looks like:

The wit-bindgen tool has a new output, wit.o, which is a wasm object file that's an input to wasm-ld
This wit.o has the necessary custom section with the binary encoding of the component for the input *.wit files
The "interface" of wit-bindgen for C is to include this file in the final compilation which will cause the custom section to show up.

This is an example of using wasm-encoder to create wit.o with a custom section called my-custom-section. There's some affordances to force wasm-ld to actually include the object file but that's just an internal implementation detail which won't affect the final binary.

Otherwise wanted to also say I think this is a great idea @peterhuene and @pchickey, I'm glad y'all discovered the but-what-import-is-actually-linked issue!

lukewagner commented 2 years ago

Assuming this solution sticks and thinking about how to close out this issue: we could simply remove the name mangling from the final section of CanonicalABI.md however, in the same way that the goal of the original name mangling scheme was to define a core-wasm-compilation-target that could be independently targeted across core-wasm-producer toolchains, allowing tools like wit-component to be reused across them, I was thinking it might be similarly useful to define the custom-section scheme (which seems pretty easy given the reuse of the type binary format) in CanonicalABI.md. Does that make sense to folks?

If so, returning to my last question above ("could the custom section define a single target component type?"):

If the intention is to use wit-bindgen to generate the custom section and if the high-level bit of the World-refactoring in wit-bindgen is to have a single up-front World as an input to bindings generation (thereby allowing it to link up all the uses and address the type-sharing problems), wouldn't we have the right component type up front? That does still leave open some questions regarding separate compilation of libraries (which indeed need to work in a wide variety of worlds), but I think that's a separate problem that has to be solved in any case (given the single-World-wit-bindgen design) as part of a separate toolchain discussion. Or am I missing some additional constraints here?

lukewagner commented 2 years ago

Just to follow up on my previous comment to capture the counterargument explained in the last component tooling meeting: while there is a single up-front wit-bindgen step that is meant to encompass the "whole" component, that step may produce a single .o that is linked by ld with other pre-compiled .os, and ld will be blindly appending custom sections, and so you might end up with multiple in the final .wasm handed to wit-component.

Another point that came up that sparked some enlightenment for me was the idea that each of those .os can contain a single (binary-encoded) componenttype (not lists of interfaces), and so what wit-component is doing is unioning those componenttypes which, with #109 and the addition of URLs to import/export names, is an explicitly-intended operation. So, when I compile my library into a .o, I choose some world which locally implies some kebab-names that I use in my library source code. The compiled .o contains both my local name choice and the "global" URL name it uniquely corresponds to (within my .o). If my library now gets linked into a component along with a different library that uses the same interface (identified by URL) but compiled against a different world that makes a different kebab-naming assignment (maybe to resolve some name conflict not present in the first world), there is no problem: First, ld, oblivious to the special semantics of the URL, will simply include both imports and both componenttypes, because the pair <kebab-name>:<URL> is unique. But then wit-component, as part of the world-unioning, will be aware of the URL-uniqueness criteria and unify these two imports into a single import of the URL in the output component, rewriting indices in both libraries to refer to the same import index. (The selection of the kebab-name of this unified import is arbitrary, but wit-component would need some default-with-override mechanism.)

peterhuene commented 2 years ago

Is there anything left keeping this issue open at this point?

alexcrichton commented 2 years ago

Ah no, I think this is all handled.

WebAssembly / component-model

String size of mangled function names could be exponentially large #106