WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.41k stars 695 forks source link

Custom sections in the text format #1153

Open copy opened 6 years ago

copy commented 6 years ago

The design document on the text format says:

WebAssembly will define a standardized text format that encodes a WebAssembly module with all its contained definitions in a way that is equivalent to the binary format.

However the specification doesn't specify how to encode custom sections: https://webassembly.github.io/spec/text/modules.html#text-module and wasm2wat ignores custom sections.

binji commented 6 years ago

Good point, there should be a way to specify these sections in the text format. It seems like this was probably discussed in the past, but I can't remember where that may have been.

@rossberg, any thoughts on this?

rossberg commented 6 years ago

Well, a couple of issues with expressing custom sections directly:

In general, the custom section format is more like a detail of the binary format. The assumption was that relevant custom sections are not written verbatim but rather synthesised from the text format, like the name section or the binding section.

What we should think about, then, is a generic syntax for annotations that can be put anywhere in the syntax tree. That would already be needed by the binding proposal. My suggestion would be to use nodes of the form (@id ...) that would be allowed anywhere in an S-expr and are uninterpreted by the core spec. The id would roughly correspond to a certain custom section, so that it is generic and extensible in a similar manner. A given tool may choose to interpret certain annotations and turn them into custom sections according to some separate spec. With global annotations in the module body such a spec could even enable spelling custom sections almost verbatim (up to their position in the binary).

WDYT?

lukewagner commented 6 years ago

Since the core spec does have a defined notion of a custom section, I think it makes sense to give a fully-specified representation in the text format. While it's true that we'd have a hard time expressing the precise placement of the custom section, I expect it's fine to just say that (custom ...) sections are just appended to the end of the module in the order they are encountered.

(Honestly, I wonder about the utility of allowing custom sections anywhere but at the end; I bet we could remove that "feature" and nothing would break.)

binji commented 6 years ago

The text format does not reflect the section order of the binary format. Hence, it is not clear how it could express where custom sections are supposed to go.

All known sections have to be ordered, so you could just use a number to specify which known section it comes after. Something like (custom 0 ...) would come before the type section (1). (custom 3 ...) would go after the function section (3) and before the table section (4).

Since the structure of custom sections is custom, they could only be given as raw bytes in the text format. That makes them only mildly useful for anything but low-level tests.

I agree if we assume the purpose of the text format is just to generate tests for the spec. But we're already using the text format as a way to express the contents of the binary, and AFAICT it doesn't lose much information currently. The only thing I can think of right now is the length of varint values and custom section data. Are there others?

Also, we could make it slightly nicer than raw bytes by having a structured data format. The name section and the reloc section follow the same basic structure of other sections, using varints, strings and vectors. If we provided those primitives we could make it pretty easy to generate. They wouldn't roundtrip very nicely of course. Something like this, maybe:

(custom 12 "foo"
  (string "hello")
  (vector
    (group (varuint32 1) (f32 3.4))
    (group (varuint32 2) (bytes "12345"))
  )
)

RE: annotations

Agreed, annotations would be useful. I believe @yurydelendik was suggesting something like this before, maybe he has some thoughts about it. And you're right, I think we could handle custom sections in a structured way doing this. But I'd like to see a way to handle a custom section that has unstructured data, or one that is unknown to the parser too.

rossberg commented 6 years ago

@binji

All known sections have to be ordered, so you could just use a number to specify which known section it comes after. Something like (custom 0 ...) would come before the type section (1). (custom 3 ...) would go after the function section (3) and before the table section (4).

That would be rather brittle and expose low-level details of the binary encoding. In particular, we have assumed that we may insert new sections anywhere in future extensions of the binary format, so a numeric scheme is not future-proof.

If we could adopt @lukewagner's suggestion of eliminating free placement of custom sections then I'd feel more comfortable, but I'm not sure how realistic that is.

The only thing I can think of right now is the length of varint values and custom section data. Are there others?

No, none that I'm aware of.

Also, we could make it slightly nicer than raw bytes by having a structured data format. The name section and the reloc section follow the same basic structure of other sections, using varints, strings and vectors. If we provided those primitives we could make it pretty easy to generate. They wouldn't roundtrip very nicely of course. Something like this, maybe:

(custom 12 "foo" (string "hello") (vector (group (varuint32 1) (f32 3.4)) (group (varuint32 2) (bytes "12345")) ) )

That would be cute. But I immediately worry about this becoming an open-ended DSL without ever eliminating bias towards known custom sections.

Agreed, annotations would be useful. I believe @yurydelendik https://github.com/yurydelendik was suggesting something like this before, maybe he has some thoughts about it. And you're right, I think we could handle custom sections in a structured way doing this. But I'd like to see a way to handle a custom section that has unstructured data, or one that is unknown to the parser too.

Sure thing, we can simply support (@custom "name" "contents") etc as a generic fallback. AFAICS, that could subsume the suggestion above.

binji commented 6 years ago

That would be rather brittle and expose low-level details of the binary encoding. In particular, we have assumed that we may insert new sections anywhere in future extensions of the binary format, so a numeric scheme is not future-proof.

Right, I forgot that new known sections may not be ordered. I think it will still work, though. If we assume that all known sections can occur only 0 or 1 times, as is currently true, then it doesn't seem like this is a problem. The number can just mean which section the custom section is before in the given module. If the section doesn't occur in the module, we could say that the text for that section is invalid. If we decide later that a known section can occur more than once, we can extend the text format at the same time to indicate which section we mean. And if using a number is gross/ugly, we can always use the names given in the spec:

(@custom "foo" (after import) "...")
(@custom "bar" (before data) "...")

If we did this, we'd probably also want to require that you can't specify sections out of order. Not so sure about the before/after thing either, but it's easy to understand and allows all placements.

If we could adopt @lukewagner's suggestion of eliminating free placement of custom sections then I'd feel more comfortable, but I'm not sure how realistic that is.

It probably isn't used much, but I would prefer not to break compatibility over it.

Sure thing, we can simply support (@custom "name" "contents") etc as a generic fallback. AFAICS, that could subsume the suggestion above.

Right, this covers everything, it just is inconvenient.

lukewagner commented 6 years ago

It probably isn't used much, but I would prefer not to break compatibility over it.

In addition to text-format motivations, there's also the fact that if it's an infrequently used feature, it will be undertested and likely to have problems in practice. I know we've had specific bugs about custom sections in weird places.

Maybe worth putting discussion/poll on CG agenda?

eholk commented 6 years ago

At the most recent CG meeting, we had some opposition to the idea of requiring custom sections to be at the end. The reason is that some uses cases for custom section involve informing later stages of the compilation pipeline. For example, tools might want to provide extra hints (which functions should be compiled first, which locals should get registers, etc.) that VMs could optionally consume. In this case, we'd want to read the hints before we start streaming compilation of the code.

binji commented 6 years ago

First pass proposal overview for custom sections in text format: https://gist.github.com/binji/d1cfff7faaebb2aa4f8b1c995234e5a0

binji commented 6 years ago

I've updated the gist after some feedback. Sorry I didn't notice this earlier, it seems that gist comments don't show up in my notifications (or I missed them).

Pauan commented 6 years ago

@binji GitHub doesn't send notifications for Gist comments, it's very annoying.

AndrewScheidecker commented 5 years ago

I prototyped something similar to @binji's proposed syntax, but an issue I ran into is that it can express more information about the section ordering than the binary format. For example, a binary module with no data segments cannot distinctly encode (@custom (after code)) and (@custom (after data)). I can't think of a nice way to solve that problem without adding explicit order information to binary custom sections.

rossberg commented 5 years ago

@AndrewScheidecker, you might want to discuss this over at the annotations proposal, which contains a more up-to-date and complete definition of custom section annotations.

To reply to your commt, though, I am not sure why you consider this a problem. There are many examples of the text format being able to express the same binary in multiple ways. How is this different?

Providing a unique way of describing placement is not a goal of these multiple forms, but being able to place something reliably in a fashion that is agnostic to the actual absence or presence of particular sections. So this is working as intended. You pick the placement that is correct in the presence of all sections, but it will also work fine if a respective section happens to be absent. You don't have to worry about which case you're in.

AndrewScheidecker commented 5 years ago

If it is useful to express ordering constraints relative to virtual sections that may or may not be present in the binary module, then it must be worthwhile to encode those constraints in the binary module somehow.

Imagine that some compiler produces a WASM object file with a custom section that needs to be ordered between the code and data sections, but that module does not contain a code section. If you want to link that object file with another that does have a code section, then you need some additional metadata (or knowledge of that particular custom section) to ensure that the custom section ends up after that code section and not before it in the linked WASM module.

There's no text format involved here, but this scenario would benefit from being able to express the ordering constraints relative to virtual sections that are proposed here for the text format only.

It's true that there's other information in the text format that is not present in the abstract syntax and binary format, but the stuff I can think of is all trivia: the interleaving of definitions of different kinds, function types that aren't explicitly declared up front, comments, whitespace, expression vs instruction syntax, etc.

rossberg commented 5 years ago

If it is useful to express ordering constraints relative to virtual sections that may or may not be present in the binary module, then it must be worthwhile to encode those constraints in the binary module somehow.

I don't think that follows. You shouldn't think of placements as a restrictive mechanism but a descriptive one.

But more importantly, as you say, this has nothing to do with the text format. Your complaint is about the design of the binary format itself.

But that is an inherent and unsolvable (and known) problem with the notion of custom data. It is true that a generic tool dealing with unfamiliar custom sections cannot know how to handle them correctly. But that is a much more general problem. To be correct, a linker might need to combine or modify certain custom sections, but by their nature of being custom, it generally has no way of knowing if or how. Their placement probably is the smallest problem such a tool faces. There is no solution to this.

It's true that there's other information in the text format that is not present in the abstract syntax and binary format, but the stuff I can think of is all trivia: the interleaving of definitions of different kinds, function types that aren't explicitly declared up front, comments, whitespace, expression vs instruction syntax, etc.

Function type desugaring in particular is way more complicated. ;)

AndrewScheidecker commented 5 years ago

I don't think that follows. You shouldn't think of placements as a restrictive mechanism but a descriptive one.

But more importantly, as you say, this has nothing to do with the text format. Your complaint is about the design of the binary format itself.

My complaint is not about only the binary format, or only the text format, but about a mismatch between them. :)

What I'm doing for now is to restrict the text format to prohibit specifying ordering relative to virtual sections that are not present according to some predicate defined on the abstract syntax. When decoding a binary module, empty sections (or sections that may not be present according to the abstract syntax predicate) are ignored for purposes of inferring the custom section order.

With those changes, I can round-trip custom sections ast->text->ast and ast->binary->ast.

Function type desugaring in particular is way more complicated. ;)

The desugaring is non-trivial, but the additional information in the text format is "trivia" in the sense that it doesn't affect the meaning of the program.

binji commented 4 years ago

I think I see what you're saying @AndrewScheidecker. I agree it would be better to continue the discussion on the annotations proposal, however. Would you mind opening a new issue there instead? We haven't done much work on that recently, but if someone picked it up, I wouldn't want this concern to fall through the cracks.