Binary encoding of annotations

binji commented 5 years ago

In the May 14 CG meeting, there was some discussion about how best to roundtrip an annotation through the binary format (i.e. text -> binary -> text), and how to associate it with a particular node in the text source. (Did I understand correctly?)

cc @titzer @fgmccabe @rossberg @jgravelle-google

fgmccabe commented 5 years ago

That is part of it from my pov. The other part is whether to formalize how annotations are attached to different elements of a wasm module. E.g., to functions, modules, imports, parameters etc.

The model for this is the JVM where annotations processing is common.

On Tue, May 14, 2019 at 2:33 PM Ben Smith notifications@github.com wrote:

In the May 14 CG meeting, there was some discussion about how best to roundtrip an annotation through the binary format (i.e. text -> binary -> text), and how to associate it with a particular node in the text source. (Did I understand correctly?)

cc @titzer https://github.com/titzer @fgmccabe https://github.com/fgmccabe @rossberg https://github.com/rossberg @jgravelle-google https://github.com/jgravelle-google

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WebAssembly/annotations/issues/5?email_source=notifications&email_token=AAQAXUA2L7IHE63WFH2NQUTPVMV37A5CNFSM4HM53F2KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GTY7CGQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQAXUD3IR7C624I45OGWKLPVMV37ANCNFSM4HM53F2A .

-- Francis McCabe SWE

rossberg commented 5 years ago

To clarify the goals of the annotation syntax, they are the following:

Have a user-friendly way to represent certain custom sections in text format.
Added bonus: allow round-tripping binary-text-binary.

Non-goals are: A. Changing the notion of custom sections in the binary format. B. Round-tripping text-binary-text in the presence of annotations that a tool does not understand.

In short, annotations are intended as a way to represent custom sections in the text format, not a new way of providing custom information.

Almost the entire discussion has been about the latter, which is out of scope for this proposal.

Wrt that discussion, I think we are talking about an intractable problem. There was the suggestion of associating annotations in custom sections with specific elements of a module in a generic fashion that all tools would understand and that would reflect in-place textual annotations 1-to-1. But AFAICS, that has serious problems:

Wasm binaries are not just sequences of byte codes, they represent non-trivial ASTs. A generic format for referencing all kinds of AST nodes would likely be verbose or inconvenient to use.
Would we force all custom sections to either choose this complicated format or forbid them to define in-place annotation syntax?
Existing custom sections, like the name section, already do not follow this format, yet would benefit from convenient in-place text representation.
We'd need a backwards-compatible way to distinguish this new kind of structured custom sections.
Even if we ignored all that, any tool that needs to perform even the slightest modification/transformation of a module still has no way of knowing how that ought to affect custom annotations that it does not understand.

The last point in particular is a fundamental problem that no amount of design sophistication can overcome. It's simply impossible.

To me this smells of over-engineering. We deliberately made custom sections as simple and generic as they are. Imposing something way more complicated now would likely be counter-productive.

fgmccabe commented 5 years ago

If the round-tripping is the primary concern, then I suggest removing annotations from other sections. I.e., no @name annotations. The issue about 'active comments' is serious. This is a serious issue for JS today: should a tool preserve JS comments? If one does not, then enough JS processors rely on 'comments' to break this. (Not all processing of wasm will be by the authors of the module.)

On Wed, May 15, 2019 at 5:22 AM Andreas Rossberg notifications@github.com wrote:

To clarify the goals of the annotation syntax, they are the following:

Have a user-friendly way to represent certain custom sections in text format.

Added bonus: allow round-tripping binary-text-binary.

Non-goals are: A. Changing the notion of custom sections in the binary format. B. Round-tripping text-binary-text in the presence of annotations that a tool does not understand.

In short, annotations are intended as a way to represent custom sections in the text format, not a new way of providing custom information.

Almost the entire discussion has been about the latter, which is out of scope for this proposal.

Wrt that discussion, I think we are talking about an intractable problem. There was the suggestion of associating annotations in custom sections with specific elements of a module in a generic fashion that all tools would understand and that would reflect in-place textual annotations 1-to-1. But AFAICS, that has serious problems:

Wasm binaries are not just sequences of byte codes, they represent non-trivial ASTs. A generic format for referencing all kinds of AST nodes would likely be verbose or inconvenient to use.

Would we force all custom sections to either choose this complicated format or forbid them to define in-place annotation syntax?

Existing custom sections, like the name section, already do not follow this format, yet would benefit from convenient in-place text representation.

We'd need a backwards-compatible way to distinguish this new kind of structured custom sections.

Even if we ignored all that, any tool that needs to perform even the slightest modification/transformation of a module still has no way of knowing how that ought to affect custom annotations that it does not understand.

The last point in particular is a fundamental problem that no amount of design sophistication can overcome. It's simply impossible.

To me this smells of over-engineering. We deliberately made custom sections as simple and generic as they are. Imposing something way more complicated now would likely be counter-productive.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WebAssembly/annotations/issues/5?email_source=notifications&email_token=AAQAXUFRKIDYW7ZQ74LJVX3PVP577A5CNFSM4HM53F2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVOPNKI#issuecomment-492631721, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQAXUEQIDF7J6AUJ546KGDPVP577ANCNFSM4HM53F2A .

-- Francis McCabe SWE

jgravelle-google commented 5 years ago

I'm less pessimistic that the custom information description problem is completely intractable, but I'm beyond certain that it is entirely orthogonal to this proposal. In my head it's doable via some form of DSL that can describe the semantics that need preserving for all forms of transformation a tool might perform. It's also made more tractable by subsetting the list of allowable transformations (e.g. you can move an opcode, but changing the instruction isn't allowed). The minimal use case here is to preserve debug information through a tool that does not model debug information, and performs destructive updates to the binary (e.g., optimization). This is very handwavey because I haven't thought about this at all, because it is wildly out of scope for this proposal.

The issue about 'active comments' is serious. This is a serious issue for JS today: should a tool preserve JS comments? If one does not, then enough JS processors rely on 'comments' to break this.

That is the wrong question. Preserving the comments through the tool is not the intention, but allowing the tool to read the comments is. Meaning, the tool should not (indeed, can not) be expected to preserve arbitrary comments. But, a tool may interpret a subset of the comments in a module. Concretely, the WebIDL Bindings proposal uses @webidl comments to represent the custom section. Tools that understand those annotations can then synthesize the correct bytes of the custom section. Specifically, a custom parser I wrote for this purpose is able to interpret that syntax and emit a custom section. Meanwhile, wabt's parser is able to read the entire module, including that custom syntax, but because it doesn't understand that section it discards those tokens, and produces a valid .wasm, absent the custom section. I then combine the output with another tool. If I then use wabt's disassembler on the final .wasm module, it does not synthesize the @webidl annotations back out, because it continues to not understand them. That is a really verbose way to say "different tools do different things, and care about different information."

I'm thinking that "this allows custom sections to round-trip" has become confused with "this mandates that annotations round-trip". The latter is the JVM model, the former is this proposal. This simply provides a primitive to extend the text format in tool-specified ways. And nothing more.

rossberg commented 5 years ago

I kind of regret ever having mentioned round-tripping. Probably my mistake to give the impression that that was an important motivation for this proposal. :)

rossberg commented 5 years ago

To see why the robust annotation problem is impossible to solve in any interesting generality, it may be necessary to abstract a little.

Wasm is a programming language. A language consists of two parts: syntax and semantics. A given definition of custom section or annotations essentially extends Wasm the language with both syntax and semantics of some form.

Any of the ideas we have been discussing can only ever hope to make custom syntax (partially) understood by tools. There is no way a tool can second-guess an unknown semantics.

However, transforming syntax has to be done in a way that maintains semantics. If you don't know what that semantics is, you cannot maintain it. It is the exception rather than the rule that a semantics is so trivial that any syntactically correct program also is semantically correct (and moreover, equivalent to the original).

To give a concrete example: one application for custom sections that I have been discussing with various folks is typing. You could refine Wasm's type system by overlaying it with more precise or rigid rules, ensuring additional properties, e.g., security ones like information flow isolation or the absence of out-of-bounds errors. That would require encoding additional type annotations in various places of a program that a custom module manager would check beforehand. We cannot possibly hope a tool to be able to transform such a program while maintaining well-typedness (a semantic condition) under this custom type system.

That may be an advanced use case, but the basic observation applies universally: in the presence of any non-trivial semantics it is insufficient to just maintain syntactic coherence (and thus, IMO, pointless to go to length to try).

titzer commented 5 years ago

I generally agree that the general problem of understanding and transforming arbitrary annotations is intractable and shouldn't be solved by this proposal. Interpreting annotations is indeed a matter left to tools. However, tools have to play nice together, and "just drop if you don't understand" fundamentally makes all tools non-interoperable. That's the wrong default, IMHO. Instead, I think the default should be "preserve if you don't understand". I also think that designing an annotation mechanism around the text format won't scale to large modules because the text format is so much larger than binary; eventually we would want binary annotations too.

The tractable part of preservation is maintaining the mapping of annotations to their locations in the syntax, which necessitates a binary encoding that can reproduce the exact location and contents of annotations. This isn't as hard as it seems at first. Java does this. There are plenty of ways to accomplish this, e.g. by having a binary "annotations" section that refers to other sections and has lists of where to insert what annotations corresponding to byte offsets of, e.g. byte offset within a function body, parameter to a function, function start, section start, etc. It's essentially an index of annotations, and could be organized by either syntax location, or annotation type, or otherwise. It can be densely encoded yet be inflated to match the original textual annotations.

It seems weird to me that we would define a text format for a syntax tree and then modify that syntax tree with additional syntactic nodes that are both discarded by default by and not preserved in the binary format. Especially if that defines tokens and syntactic elements that must be at least parsed by a text parser. That's not really syntax then; it's comments, but more restrictive than comments in that it enforces a syntactic structure that comments don't.

In short, I think full roundtripping of annotations to binary and text is the only reason to standardize an annotation proposal at all.

rossberg commented 5 years ago

@titzer, the purpose of this proposal is to provide a generic way to represent custom sections in the text format in human-readable form. What alternative do you propose?

A few more comments:

Round-tripping text-binary-text was never a design goal of the text format AFAICT, and has never fully worked, since various sugarings get lost on the way.
Round-tripping binary-text-binary does not currently work either, because custom sections cannot be represented in text. With this proposal, at least that direction would work.
Custom annotations do not "modify the syntax tree" any more than custom sections "modify" the binary byte stream.
You cannot "preserve if you don't understand". Because you cannot know what "preserve" means if you don't understand.

titzer commented 5 years ago

I understand the highest priority item is to roundtrip custom sections. What is the role of annotations that are attached to expressions within function bodies and elsewhere? Perhaps we can split that part out?

rossberg commented 5 years ago

Ah, annotations aren't syntactically attached to anything per this proposal. They can be lexically inserted anywhere in a source file, just like comments. There are no a priori rules or semantics regarding placement or interpretation whatsoever. Just as with custom sections. But tools may impose certain requirements on those they want to interpret. Again, just as with custom sections.

So I'm not sure what can be split out. It's already as minimal as it can possibly get.

fgmccabe commented 5 years ago

I do not see the problem with “preserve” if you do not understand. It can be modeled as “this entity has an annotation” If a tool purports to transform any entity it must understand that entity - including the presence of annotations. If it cannot handle the annotation (if for example the tool materially modifies the semantics of same entity) then it is as though the tool is not recognizing the entirety.

rossberg commented 5 years ago

@fgmccabe, how do you know that the contents of an annotation do not depend on other entities? For example, it refers to some definition, local, block, type; assumes some value, type, offset, size? How do you know that a local change to one entity does not affect annotations on other entities?

fgmccabe commented 5 years ago

Based on my admittedly limited experience with comments in JS, and annotations in Java, this is part of the process. When a third party designs an annotation scheme, he/she needs to be aware of the potential impact on tools. I see no reason why annotations are special here: your arguments apply to wasm itself too. (to take an example, a tool that processes wasm to remove common sub-expressions had better understand the full implications of that).

On Thu, May 16, 2019 at 8:19 AM Andreas Rossberg notifications@github.com wrote:

@fgmccabe https://github.com/fgmccabe, how do you know that the contents of an annotation do not depend on other entities? For example, it refers to some definition, local, block, type; assumes some value, type, offset, size? How do you know that a local change to one entity does not affect annotations on other entities?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WebAssembly/annotations/issues/5?email_source=notifications&email_token=AAQAXUEV2ZE5GWCLQI6EPH3PVV3OHA5CNFSM4HM53F2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVSEVMY#issuecomment-493111987, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQAXUBUXT4WSWTMIX47VSLPVV3OHANCNFSM4HM53F2A .

-- Francis McCabe SWE

rossberg commented 5 years ago

Well, the whole purpose of custom sections was that some parties can add additional stuff to their binaries without having to coordinate with all tool writers in the universe. That would require collaboration to the degree of quasi standardisation, which defeats the purpose.

jgravelle-google commented 5 years ago

However, tools have to play nice together, and "just drop if you don't understand" fundamentally makes all tools non-interoperable. That's the wrong default, IMHO. Instead, I think the default should be "preserve if you don't understand".

Preserving unmodeled annotations or custom sections is an incredibly dangerous game, if your tool makes any transformations.

To me, this + roundtripping can be solved with two extremely simple conventions.

(@custom bytes "string of bytes to roundtrip, drop if modified")
(@custom metadata "string of bytes to preserve no matter what") A tool doesn't need to understand the contents of an @custom annotation, in order to be able to preserve it, it just needs to understand the @custom. This should be less work than any more-complex proposal, so it's believable that all tools will implement that minimal convention.

It's important to remember that tools want to be as interoperable as is reasonable. This proposal gives tools an additional primitive by which to coordinate.

I also think that designing an annotation mechanism around the text format won't scale to large modules because the text format is so much larger than binary; eventually we would want binary annotations too.

To me this is a non-sequitur, which makes me think we have very different understandings of the problem this proposal is attempting to solve, so I want to dig a little deeper here. To me, this proposal is fundamentally, and strictly, about adding additional, non-standardized notation to the text format. The only reflection of that in the binary format is dependent on how a given tool interprets said notation.

Also, we already do have such a mechanism for the binary format. Custom sections. Tools drop custom sections they don't understand as well. Or preserve them, depending on what the tool does. This is already a consideration we make.

That's not really syntax then; it's comments, but more restrictive than comments in that it enforces a syntactic structure that comments don't.

Yes. That is the point. The restricted comment structure allows the lexer to produce tokens that the parser can reason about, or not. It was more trivial to implement than to argue for.

rossberg commented 5 years ago

@jgravelle-google:

To me, this + roundtripping can be solved with two extremely simple conventions.

(@custom bytes "string of bytes to roundtrip, drop if modified") (@custom metadata "string of bytes to preserve no matter what")

Even that would already go beyond what the binary format currently offers, since it cannot represent that distinction.

jgravelle-google commented 5 years ago

Even that would already go beyond what the binary format currently offers, since it cannot represent that distinction.

Which is a useful property to have.

Showerthought: a custom section that is itself an index of other custom sections, saying whether to drop or preserve by default. Or extend that to "preserve section X under conditions Y".

tlively commented 6 months ago

Let's close this issue as out-of-scope. There are a bunch of use cases (e.g. branch hinting, compilation hints, various ideas in Binaryen) that depend on this proposal and none of them require a generic binary format for arbitrary annotations.

tlively commented 5 months ago

ping @rossberg. I don't have permissions to close this myself.

WebAssembly / annotations

Binary encoding of annotations #5

Almost the entire discussion has been about the latter, which is out of scope for this proposal.