WebAssembly / annotations

Proposal for Custom Annotation Syntax in the Text Format
https://WebAssembly.github.io/annotations/
Other
19 stars 10 forks source link

Should we allow more identifiers instead of having @name? #21

Open tlively opened 10 months ago

tlively commented 10 months ago

The utility of the @name annotation comes from being able to use more than the restricted set of ASCII characters allowed in the identifiers already present in the text format. But it's kind of awkward to have both an identifier and a name annotation in the text format when only one can appear in the binary name section. The status quo where you can have arbitrary strings in the name section but cannot represent them as identifiers is also awkward. Neither allow full bidirectional round-tripping of identifiers and names.

I propose that instead of adding a @name annotation, we expand the grammar for identifiers to allow arbitrary strings. With the extended name section proposal, this would allow full bidirectional round tripping of identifiers/names.

The specific grammar for identifiers could be:

id ::= '$' idchar+ | '$' string

There would be no ambiguity in this grammar because strings always start with ", which is not included in idchar.

If this is the direction we decide to go in, I would suggest that the extension of the grammar be part of the extended name section proposal. cc @ashleynh

yuri91 commented 10 months ago

Just thinking out loud about the implementation of this in the new custom section support in the interpreter:

We can still use the current custom section handler for the name section, and for text parsing, instead of getting the names from @name annotations, get them from the identifiers in the text.

It is a bit annoying that the AST does not keep the names, so they need to be re-parsed from text, but it's doable.

rossberg commented 10 months ago

This seems fine, as long as there is no expectation that we are getting into any of the Unicode messes surrounding normalisation etc., i.e., for determining id equality, we are comparing just raw code points (which may or may not be what users might expect).

rossberg commented 10 months ago

PS: Also, a reminder that Unicode spoofing has become a real security concern. Allowing completely arbitrary Unicode strings in something as central and sensitive as identifiers is elevating that risk. It's hard to predict how relevant that is for Wasm, but I could imagine security-sensitive hand-written Wasm in the future, which makes me feel uncomfortable about this. Perhaps we should try to get the opinion of some knowledgeable security experts?

dzfrias commented 9 months ago

Just to jump into this, I've been working on an implementation of the @name annotation at WebAssembly/wabt#2297, and I think this new syntax is a good idea from both a grammar standpoint and I believe is just more intuitive in general.

Having a "binary-specific" name (from the @name annotation) detached from the actual identifier posed a few of questions while I was been working on the implementation:

Given these problems, I don't think @name helps very much when trying to roundtrip invalid names. I think the $"name" syntax is a much better idea, both for implementing a tool like wasm2wat and wat2wasm and for when writing WAT files in general.

alexcrichton commented 9 months ago

Currently in wasm-tools when a wasm file is printed any names of functions which are not valid identifiers cause a (@name ...) annotation to get printed along with a synthesized $... identifer that is as well valid. I personally find the idea here, being able to print $"...", is a great idea as it would help simplify this and make it easier to read as well.

One gotcha though is that the custom name section has no requirements on uniqueness of identifiers but the text format does have requirements on this. For example this module:

(module
  (func (@name "foo"))
  (func (@name "foo"))
)

is not valid to print with $foo (or $"foo") for both identifiers. This means that if the (@name ...) annotation were removed then I don't think there would be a remaining means to print a module like this in the text representation.

For comparison, currently wasm-tools print over the above module prints:

(module
  (type (;0;) (func))
  (func $foo (;0;) (type 0))
  (func $#func1<foo> (@name "foo") (;1;) (type 0))
)

where the second function gets @name to indicate its name in addition to a synthesized $-name.

tlively commented 9 months ago

Ahh, good point that we so far would not be able to round-trip binaries with repeated names in the name section. We could solve that by allowing repeated identifiers in the text format and using indices (with comments) rather than identifiers to disambiguate at the use sites.

tlively commented 3 months ago

@rossberg, what do we need to make progress here? Given that the names section is used only as a debugging aid, I haven't heard anyone else voice concerns about unicode spoofing risks.

rossberg commented 3 months ago

I'm okay doing this change, but it requires some spec work. The easy part is to extend the definition of identifiers to

id ::= '$' idchar+ | '$' name

But that's far from enough. The text format spec currently relies on equality of identifiers being completely syntactic. If we allow strings, then syntax and denotation can differ, so we'll need to introduce equivalence classes, and need to define that elaboration. For example $"\41" is equal not just to itself but also to $"A" (and presumably $A).

bvisness commented 1 month ago

After today's CG discussion, I think it makes more sense to keep the @name annotation, since it doesn't seem like identifiers will ever be able to express the full range of what the name section allows.

I would also suggest that we shouldn't worry about round-tripping identifiers. It's not a bad thing to put identifiers in the name section, or to pull identifiers from the name section, but identifiers don't really correspond to anything in the source language and hence I don't think it's that critical to preserve them. (To that end, I don't think I see the value of quoted identifiers if also keeping @name.)

rossberg commented 1 month ago

I was about to merge #23, but I don't mind either way. They do add some complexity to the spec, o.t.o.h. it's handy to have an escape mechanism for richer Wasm identifiers.

How strongly does everybody feel?

tlively commented 1 month ago

I very much want the extended identifiers, even though we're also keeping the name annotation. For the common case where there are not multiple or repeated names, the names can match 1:1, so being able to use the arbitrary names as identifiers is useful. If we don't have the extended identifier syntax, then tools would have to do their best to synthesize meaningful identifiers for names that are not also valid identifiers.

alexcrichton commented 1 month ago

I would also cast my vote in favor of extended identifiers. Synthesized identifiers are (subjectively to me) much harder to read than $"foo" and not having to synthesize an identifier at all would be useful.

rossberg commented 1 month ago

Okay, thanks, I'll move ahead with merging the PR then.