Open tlively opened 10 months ago
Just thinking out loud about the implementation of this in the new custom section support in the interpreter:
We can still use the current custom section handler for the name section, and for text parsing, instead of getting the names from @name
annotations, get them from the identifiers in the text.
It is a bit annoying that the AST does not keep the names, so they need to be re-parsed from text, but it's doable.
This seems fine, as long as there is no expectation that we are getting into any of the Unicode messes surrounding normalisation etc., i.e., for determining id equality, we are comparing just raw code points (which may or may not be what users might expect).
PS: Also, a reminder that Unicode spoofing has become a real security concern. Allowing completely arbitrary Unicode strings in something as central and sensitive as identifiers is elevating that risk. It's hard to predict how relevant that is for Wasm, but I could imagine security-sensitive hand-written Wasm in the future, which makes me feel uncomfortable about this. Perhaps we should try to get the opinion of some knowledgeable security experts?
Just to jump into this, I've been working on an implementation of the @name
annotation at WebAssembly/wabt#2297, and I think this new syntax is a good idea from both a grammar standpoint and I believe is just more intuitive in general.
Having a "binary-specific" name (from the @name
annotation) detached from the actual identifier posed a few of questions while I was been working on the implementation:
@name
be valid when there's no corresponding $name
when writing an identifier for a function, for example? I'd assume not, or else the grammar would have to change to support @name
as a substitute for $name
$name
form if both $name
and @name
are required?Given these problems, I don't think @name
helps very much when trying to roundtrip invalid names. I think the $"name"
syntax is a much better idea, both for implementing a tool like wasm2wat
and wat2wasm
and for when writing WAT files in general.
Currently in wasm-tools when a wasm file is printed any names of functions which are not valid identifiers cause a (@name ...)
annotation to get printed along with a synthesized $...
identifer that is as well valid. I personally find the idea here, being able to print $"..."
, is a great idea as it would help simplify this and make it easier to read as well.
One gotcha though is that the custom name section has no requirements on uniqueness of identifiers but the text format does have requirements on this. For example this module:
(module
(func (@name "foo"))
(func (@name "foo"))
)
is not valid to print with $foo
(or $"foo"
) for both identifiers. This means that if the (@name ...)
annotation were removed then I don't think there would be a remaining means to print a module like this in the text representation.
For comparison, currently wasm-tools print
over the above module prints:
(module
(type (;0;) (func))
(func $foo (;0;) (type 0))
(func $#func1<foo> (@name "foo") (;1;) (type 0))
)
where the second function gets @name
to indicate its name in addition to a synthesized $
-name.
Ahh, good point that we so far would not be able to round-trip binaries with repeated names in the name section. We could solve that by allowing repeated identifiers in the text format and using indices (with comments) rather than identifiers to disambiguate at the use sites.
@rossberg, what do we need to make progress here? Given that the names section is used only as a debugging aid, I haven't heard anyone else voice concerns about unicode spoofing risks.
I'm okay doing this change, but it requires some spec work. The easy part is to extend the definition of identifiers to
id ::= '$' idchar+ | '$' name
But that's far from enough. The text format spec currently relies on equality of identifiers being completely syntactic. If we allow strings, then syntax and denotation can differ, so we'll need to introduce equivalence classes, and need to define that elaboration. For example $"\41"
is equal not just to itself but also to $"A"
(and presumably $A
).
After today's CG discussion, I think it makes more sense to keep the @name
annotation, since it doesn't seem like identifiers will ever be able to express the full range of what the name section allows.
I would also suggest that we shouldn't worry about round-tripping identifiers. It's not a bad thing to put identifiers in the name section, or to pull identifiers from the name section, but identifiers don't really correspond to anything in the source language and hence I don't think it's that critical to preserve them. (To that end, I don't think I see the value of quoted identifiers if also keeping @name
.)
I was about to merge #23, but I don't mind either way. They do add some complexity to the spec, o.t.o.h. it's handy to have an escape mechanism for richer Wasm identifiers.
How strongly does everybody feel?
I very much want the extended identifiers, even though we're also keeping the name annotation. For the common case where there are not multiple or repeated names, the names can match 1:1, so being able to use the arbitrary names as identifiers is useful. If we don't have the extended identifier syntax, then tools would have to do their best to synthesize meaningful identifiers for names that are not also valid identifiers.
I would also cast my vote in favor of extended identifiers. Synthesized identifiers are (subjectively to me) much harder to read than $"foo"
and not having to synthesize an identifier at all would be useful.
Okay, thanks, I'll move ahead with merging the PR then.
The utility of the
@name
annotation comes from being able to use more than the restricted set of ASCII characters allowed in the identifiers already present in the text format. But it's kind of awkward to have both an identifier and a name annotation in the text format when only one can appear in the binary name section. The status quo where you can have arbitrary strings in the name section but cannot represent them as identifiers is also awkward. Neither allow full bidirectional round-tripping of identifiers and names.I propose that instead of adding a @name annotation, we expand the grammar for identifiers to allow arbitrary strings. With the extended name section proposal, this would allow full bidirectional round tripping of identifiers/names.
The specific grammar for identifiers could be:
There would be no ambiguity in this grammar because strings always start with
"
, which is not included inidchar
.If this is the direction we decide to go in, I would suggest that the extension of the grammar be part of the extended name section proposal. cc @ashleynh