Inferring import kind from string syntax

alexcrichton commented 10 months ago

I've been reviewing https://github.com/bytecodealliance/wasm-tools/pull/1146 which is the start of the implementation of implementation imports for the component model and it's raising questions about internal details which I wanted to raise to the design level. Before this PR only two forms of imports were supported for components:

(import "foo" (func ...))
(import (interface "foo:bar/baz") (func ...))

With the recently specified implementation imports the above PR is adding support for new forms of imports:

;; from before
(import "foo" (func ...))
(import (interface "foo:bar/baz") (func ...))

;; new
(import "foo" (integrity "xx") (func ...))
(import "foo" (url "xx") (func ...))
(import "foo" (relative-url "xx") (func ...))
(import (locked-dep "foo:bar/baz") (func ...))
(import (unlocked-dep "foo:bar/baz") (func ...))

Throughout these refactorings, and previously when (interface ...) imports were added, the internal data structures of much of the tooling around the component model ignores this metadata and instead thinks of imports as a map of "string to thing". This is additionally done for instantiation where instantiation arguments are provided as a list of "string to thing". Each import form then has a canonical string associated with it that is used internally. This canonical string is what disallows overlap between imports, but it additionally loses context like (url ...) and (integrity ...) which I believe is ok for the current use cases of the tooling (e.g. the url or integrity doesn't affect validation)

(import "foo" (func ...))                         ;; name = "foo"
(import (interface "foo:bar/baz") (func ...))     ;; name = "foo:bar/baz"
(import "foo" (integrity "xx") (func ...))        ;; name = "foo"
(import "foo" (url "xx") (func ...))              ;; name = "foo"
(import "foo" (relative-url "xx") (func ...))     ;; name = "foo"
(import (locked-dep "foo:bar/baz") (func ...))    ;; name = "foo:bar/baz"
(import (unlocked-dep "foo:bar/baz") (func ...))  ;; name = "foo:bar/baz"

So far so good, but a problem is starting to arise at the next step of integrating this change into tooling. There are a number of locations where this intermediate representation of "string to thing" is then reencoded as a component. For example wasm-compose uses the results of wasmparser validation to create a new component. This walks over the imports of one component and generates new imports in an outer component based on the union of subcomponents (e.g. you import foo, I import bar, when we're composed the outer component imports foo and bar). With implementation imports this is starting to break down because the results of validation don't have all the metadata for imports like urls/integrity or even a differentiator for the kind of import (e.g. interface vs locked-dep).

Previously this sort of worked where the structure of the name could be used to infer the import. For example if the name had a / or : then it previously was required to be an interface import where otherwise it was a kebab-name import. Now though there are many more fields to infer and additionally some that are not syntactically distinguished by their string (e.g. (interface "a:b/c") and (locked-dep "a:b/c").

So far I believe we've been roughly trying to keep an equivalence where "map of strings" is a valid way to view the imports and exports of a component. The binary encoding is stricter to provide more semantic meaning and enumerate the various accepted forms. Currently, however, the change with implementation imports is feeling like it's pushing in the direction of "map of string to thing" is no longer a valid representation for component imports.

Thus, I'm opening up this issue for some further discussion. I'm curious if there are thoughts about maybe I'm approaching this completely the wrong way. Or are we trying to stuff too much into imports? Or is "map of string to thing" no longer desired and implementations should all be refactored?

I originally started typing all this up to solve an ambiguity between (interface "a:b/c") and (locked-dep "a:b/c") by perhaps having their string representation be syntactically different, or something like that. I realize though that this still doesn't take into account integrity which wasm-compose otherwise wouldn't be able to preserve today either. I'm not actually sure how best to support that myself, which is why I'm thinking a bit broader here at the end of typing this.

lukewagner commented 10 months ago

Thanks for clearly articulating the issue! One high-level design choice we can discuss is: does all the importname metadata go inside the quoted string or not. E.g., instead of what you wrote above, we could alternatively have (strawperson syntax here):

(import "foo" (func ...))
(import "interface(foo:bar/baz)" (func ...))
(import "contents(integrity=xx)" (func ...))
(import "url(xx, integrity=xx)" (func ...))
(import "relative-url(foo, integrity=xx)" (func ...))
(import "locked-dep(foo:bar/baz, integrity=xx)" (func ...))
(import "unlocked-dep(foo:bar/baz)" (func ...))

With this, the string would be all you need. One downside I had been imagining that motivated me towards the current design is: when a component is instantiated with explicit arguments (e.g., via the import object of WebAssembly.instantiate() or wasm-compose's input language), it's a bit gnarly to write out a full URL or integrity hash, hence wanting to separate out "the unique key string" vs. "the full name". That being said, taking a fresh look at this concern with a better understanding of the emerging tooling workflows, this might not be too much of a problem in practice:

With wasm-compose and hopefully-future ESM-integration alternatives to WA.instantiate(), you'll get "default propagation" of imports which means that you won't have to actually write the full string unless you want to explicitly supply something non-default (which should be much more common for interfaces imports than for implementation imports).
URLs and content hashes will often only be inserted at the end of a compiler pipeline.
The call to WA.instantiate() is usually auto-generated.

Another possible downside is that, by encoding this structured info into a string, the string might devolve into an ad hoc complex mini-language in the future. It's hard to know how much of a problem this will be. Based on the above example syntax, though, the pattern <label>(...) would give us a ton of room in the future to backwards-compatibly add more <label>s and whatever we want inside the parens, so maybe we're fine here too.

WDYT?

alexcrichton commented 10 months ago

Personally I was also a fan originally of keeping things structured, but I agree that as this has emerged over time it may be best to go back to using strings for everything. I do still think there's a case to be made for it, for example if each structured import clearly mapped to an unambiguous string, e.g. your strawperson syntax, then the structured form could perhaps be considered easier to validate or something like that.

With the above syntax, though, are you imagining that to satisfy contents(integrity=xx) as part of an instantiation argument you'd have to specify (with "contents(integrity=xx)" (...))?

lukewagner commented 10 months ago

Yes, if everything goes in the string, then that's what with would have to say as well. That does hypothetically make a size argument against, but I think the general solution here is defining a "strings" section/index-space for factoring out common strings (which would help more than just this one case of duplication).

But yeah, another option to consider is to define a lossless mapping from externname (as it currently stands) to a string so that, if you did just want a string, you could have that. I initially started thinking in this direction when writing my first reply, but then I started to worry about having 3 different concepts of import name (the AST, the "full string encoding", and the shorter "unique key"). But maybe "the full string encoding" is just an impl detail that doesn't surface to most devs?

alexcrichton commented 10 months ago

Ok makes sense. I think I basically don't feel that there's a slam dunk in any direction. The downsides of various approaches I think are:

As-is today - tooling can't use a "map of string to thing" approach to represent imports/exports since imports can be recreated from this representation
Everything is a string - long strings and repetition. I'm also worried that if you import something with an integrity hash then a runtime providing that same import without an integrity hash would then be incompatible. If the instantiation is auto-generated there's no worries, but I'm thinking of a wasmtime-like "here's a binary that runs wasm" environment.
One-to-one mapping from structure to strings - as you mention a high number of concepts to keep around.

I feel like I would lean a bit towards your most recent suggestion though. That way tools can continue to use short strings where possible to identify imports/exports if it's not necessary to recreate the import/export. Tools can then use long strings to have string-to-thing maps work well if recreation of an import is necessary. And finally the binary format is simpler as it would still encode structure.

Not exactly a simple solution but then again it seems like a complex space so not overly complex of a solution either.

guybedford commented 10 months ago

Very interesting discussion, reimagining packaging conventions is surprisingly hard, as it becomes clear how much really is just convention over specification and how much cross-interaction there is to consider.

I like the sentiment of simplifying on strings. One problem with overloading the strings too much is that normalization starts to become a little more ill-defined. The unstructured nature of the string still requires parsing, resolution and normalization operations in tooling, and so structure and convention is very quickly needed again.

If seeking to reduce structure, a middle ground might be tagged strings + arbitrary structured key / value metadata:

Starting with the existing example:

(import "foo" (func ...))
(import "interface(foo:bar/baz)" (func ...))
(import "contents(integrity=xx)" (func ...))
(import "url(xx, integrity=xx)" (func ...))
(import "relative-url(foo, integrity=xx)" (func ...))
(import "locked-dep(foo:bar/baz, integrity=xx)" (func ...))
(import "unlocked-dep(foo:bar/baz)" (func ...))

Then expressing that in the spec primitivies of tagged strings (name | id | url) and arbitrary attributes:

(import (name "foo") (func ...)) ; kebab name
(import (id "foo:bar/baz") (func ...)) ; IDs as distinct from names and URLs
(import (url "sha256:xx") (func ...)) ; content addressing integrity via URL-like conventions
(import (url "xx") (attr integrity "xx") (func ...)) ; integrity is an attribute if not content-addressing
(import (url "foo") (attr integrity "xx") (func ...)) ; relative url is still a url
(import (id "foo:bar/baz@1.2.3") (attr integrity xx) (func ...)) ; locked deps as fully constrained ids with exact versions
(import (id "foo:bar/baz") (attr constraint "^1.2") (func ...)) ; unlocked deps as non-exact ids with constraints

The important point is the spec doesn't need to specify the conventions and details, just that there are tagged strings and attributes of the formst:

name | id | url string: the kebab name, interface name, absolute or relative url depending on which case applies. Structure could possibly even be simplified further to just be strings without tagging via conventional string parsing rules, where ids are a subset of URLs of sorts.
Non-identifying key / value metadata attributes for imports only: do not form part of the instantiation argument, do not form part of the identity, but can be used to authoritatively drive linking and resolution information. The encoding as just a list of arbitrary key value string pairs.

While metadata keys can be arbitrary, we already have a bunch defined so then as conventions emerge they can be explicitly specified and reserved for very specific scenarios. The benefit of metadata attributes then being able to balance having some structure while evolving conventions over time as needed.

lukewagner commented 10 months ago

@alexcrichton Ok, if you're leaning towards keeping things the same (iiuc, the lossless mapping would be a toolchain-internal detail?), I'm happy to do that, at least until we collect more experience to suggest otherwise. (But this discussion has left me feeling like 40% in favor of the single-string approach.)

@guybedford Having generic key/value metadata seems orthogonal the root question (of single-string vs. component-AST-level separation), since you could do generic key/value metadata either way. In general, I worry that a generic key/value metadata would end up not providing the semantics necessary for the myriad of tools we need to build to interoperate with arbitrary components. Also, it's just a matter of time before attribute naming conflicts cause someone to re-propose XML namespaces ;-)

alexcrichton commented 10 months ago

Now that you say that Luke, I'm also more in favor of single strings (I'm waffling a lot here). It feels weird to expect tooling to do one thing where the binary format and producers do something completely different (e.g. tooling strings, producers/binary structured).

To confirm though, the idea is that we remove all structure in the binary format and at the binary level we simply say "this is a string". We then basically have a set of regexes/requirements that the string must look like various forms? Requiring structured strings seems like it would handle what @guybedford was mentioning too because we wouldn't run the risk of completely unstructured strings just yet.

lukewagner commented 10 months ago

To confirm though, [...]

Yep, it would be just as structured and validated as it is now, it's just that that structure would be inside a quoted string (symmetric to how we currently do <name>). I've been waffling on this too, but symmetry with <name> (where we also could've gone with the "do it in the AST" approach, but chose not to) is attractive.

alexcrichton commented 10 months ago

Ok I think I agree then that yeah we should probably go with that (lest I waffle again in another direction)

lukewagner commented 10 months ago

Cool, I'll work up a PR to discuss the concrete string format.

WebAssembly / component-model

Inferring import kind from string syntax #253