Disambiguating the type of the integral token in `f32 0` and `i32 0`.

carlsmith commented 3 years ago

The Wasm spec allows float literals to be expressed as a sequence of ASCII digits (ignoring single-underscores between the digits and any sign prefix). The dot is not required, so 0 can be a float or an unsigned integer, while +5 and -123 can be floats or signed integers.

I learnt from this issue that the reference interpreter just tokenizes the ambiguous cases as integer types, and relies on the parser (that has the context) looking for misclassified integrals when parsing stuff like f64 50.

Regarding the annotations proposal, what happens to code like this?

(@id f32 0 i32 0)

Annotations cannot be parsed to contextualize and reclassify the first zero as a float literal, as annotations allow arbitrary sequences of tokens, without context.

By removing the context that the parser (and by extension, the tokenizer) depend on to disambiguate tokens, it becomes impossible to disambiguate certain tokens that are legal in an annotation. Anyone processing an annotation as an array of tokens will not be able to classify them unambiguously (if they rely on the WAT abstract grammar).

Until now, users have not generally needed to deal with WAT token streams directly, but the Annotations proposal makes it important that they can. They should be able to define a rule that says an i64 token must be followed by a float literal, and not need edge cases for reclassifying signed and unsigned integer tokens.

Ideally, ambiguous literals could be phased out of WAT altogether. If that cannot be done, then the Annotation spec should at least state that ambiguous literals are strictly interpreted as integers inside of an annotation. Having to write 0.0 inside of annotations (when 0 is fine everywhere else) is not ideal, but at least every token could then be correctly classified (based only on the spec).

Edit: I originally suggested predicating support for annotations on not using ambiguous literals, so f64 0 would be illegal anywhere within a module that contains an annotation. On reflection, I'm much less sure about that approach, but still think WAT has a design flaw that is worth trying to fix, and is especially problematic for the Annotations proposal.

carlsmith commented 3 years ago

The issue that's linked to above discusses the number literal thing at some length. The conclusion was that this is all fundamentally to do with abstract definitions in the grammar, and should not create any real issues for annotations (or working with them).

I've suggested they close that issue, and keep this one open, just to allow people working on this proposal to weigh in, or also close, if they agree it's a non-issue.

conrad-watt commented 3 years ago

Since this is only an (minor conceptual) issue with lexing, I suggested a small change to the top-level definition of token and annotelem in https://github.com/WebAssembly/design/issues/1376#issuecomment-691675967, so that their shallow definitions correspond to a less ambiguous interface of lex-time token types. This is just a minor matter of "soft communication" though, since the spec isn't attempting to explicitly standardise a standalone lexer interface.

As @carlsmith and I discussed in the linked issue, the parser already has to disambiguate numeric tokens via contextual information beyond just distinguishing int vs float (for example, when enforcing i32 range limits). Moreover, even int vs float disambiguation is not required for annotations (which only need to be coarsely lexed to check for invalid characters/nesting before the tool-specific handling can take over).

carlsmith commented 3 years ago

Just to be clear, I didn't understand @conrad-watt's position at first, just technically, but fully argee now.

rossberg commented 3 years ago

Hm, I don't quite understand the concern. This is just lexical syntax, defined by regular expressions (where | has no requirement to be disjoint, but otoh, any regexp has to be substitutable). The spec never assigns any meaning to tokens per se, nor any form of classification, and intentionally so -- because annotations are even more flexible than you seem to assume.

In particular, it's a totally plausible scenario that a tool interpreting specific custom annotations will use a completely different lexer for their contents. All a custom definition for an annotation syntax has to ensure is that its overall grammar is covered by the generic grammar in the Wasm spec. But it might use completely different tokenization! That is fine and an intended use case.

For example, imagine I want to embed arithmetic expression in an annotation. Then I would want to define tokens like +, -, *, etc. Moreover, I would want to lex x*-2 as four separate tokens inside the annotation, whereas it's only considered one by a generic Wasm lexer.

In general, a parser either understands a specific annotation syntax, but then it's not unlikely to switch to a different tokenization for it anyway, or it doesn't, in which case it doesn't care about the specific tokenization. Consequently, it wouldn't even seem particularly useful for the Wasm spec to suggest a specific token classification for annotations.

conrad-watt commented 3 years ago

I think @carlsmith's perspective comes from caring about the engineering of a standalone lexer which offers a standard API surface to separately-designed tools. In this situation the lexer can't tailor its tokenisation to the requirements of subsequent stages, and it makes a kind of sense for such a lexer to want to align its token types with the token and annotelem definitions in the official spec.

I agree this was never something the spec explicitly cared about standardising - so it's legitimate to resolve this as being beyond our scope. That being said, in this case it is easy to make the (shallow) cases of token and annotelem "more disjoint" as a "helpful push".

carlsmith commented 3 years ago

@conrad-watt pretty much said it already. My perspective has changed since I opened the issue, and agree now that this is a conceptual confusion that revolves around what you expect from the spec. There is perhaps an opportunity to improve the spec, but nothing really needs to change.

WebAssembly / annotations

Disambiguating the type of the integral token in `f32 0` and `i32 0`. #12