WebAssembly / tool-conventions

Conventions supporting interoperatibility between tools working with WebAssembly.

Artistic License 2.0

306 stars 66 forks source link

Code Annotations Framework #167

Closed yuri91 closed 3 years ago

yuri91 commented 3 years ago

As discussed in CG-08-17 I am opening this issue to present a draft of a Code Annotations Framework.

Link to the slides of my presentation: https://drive.google.com/file/d/17z97_p_TNkrJt0p-ffvHqLHjOlj2FgNt/view?usp=sharing

I am opening an issue and not directly a PR because there is a decision to make, and I would like to have some suggestions on which option to pick.

The decision is about which binary format to use. In the presentation I described only one of two options, with different pros and cons. Here are both of them:

Option 1: Flat absolute offsets (offsets are relative to the start of the code section)


codeannotationsec ::= vec(codeannotation)

codeannotation ::= codepos: u32
                   kind: u32
                   data: vec(byte)

Option 2: Nested relative offset (offsets are relative to the start of the function body)

codeannotationsec ::= vec(funcannotations)

funcannotations ::= idx: funcidx
                    vec(annotation)

annotation ::= funcpos: u32
               kind: u32
               data: vec(byte)

Pros option 1:

simpler
same absolute offsets as used in debug traces

Pros options 2:

less things to edit when modifying the modules
- example: a linker will need to relocate each annotation entry instead of just each function subsection.
easier to produce in some cases
- real example: In CheerpX a jitted function is produced alongside its branch hints, and later "assembled" in a module. With option 2 assembling the annotation section is just a matter of writing the final index of the function and append the pre-produced list of hints. With option 2 the list of hints must be updated by adding the code offset of the start of the function to all the entries.

Both options can be supported by linkers with existing relocation kinds.

The following is a draft of the document assuming option 1 (again, for semplicity):

Code Annotations Framework

This document describe a convention for encoding a number of WebAssembly features (here generically called "code annotations") that don't affect the module semantics. The goal of this convention is to make it easier for tools that work on WebAssembly modules to handle and preserve such code annotations.

Note: This is a work in progress

Code Annotation

A code annotation is a piece of metadata attached to a position in the code section.

Its meaning depends on the particular type of annotation.

Discarding a code annotation does not change the semantics of the module.

A tool that transform the module must drop any code annotation that it does not know or know it can’t preserve.

Binary Representation

Each type of code annotation is encoded in a custom section named code_annotation.<type>.

Such sections have the following binary format:

codeannotationsec ::= vec(codeannotation)

codeannotation ::= codepos: u32
                   kind: u32
                   data: vec(byte)

Where codepos is an offset in the code section, kind is an enumeration value whose meaning depends on the section type, and data is a vector with a further payload, whose content depends on the section type.

codeannotation entries must appear in order of increasing codepos, and duplicate codepos values are not allowed.

Text Representation

Code annotations are representend in the .wat format using custom annotations, as follows:

(@code_annotation.<type> kind:i32 data:str)

The kind and data fields correspond to the fields with the same name in the binary representation. The code position is implicit and it is derived by the position of the annotation:

Custom annotations can appear anywhere , but code annotations are allowed only before function definitions (in which case the code position is the offset of the start of the function body in the code section), or before an instruction (in which case the code position is the offset of the instruction in the code section).

Example:

(module
  (type (;0;) (func (param i32 result i32)))
  (func (@code_annotation.hotness 1) $test (type 0)
    (@code_annotation.branch_hints 1) if
      i32.const 0
      local.set 0
    end
    local.get 1
    (@code_annotation.with_payload 10 "aaa\13bb") return
  )
)

Code annotation types

Currently the following type of code annotations are defined:

Branch Hints

section name: code_annotation.branch_hints
kind values:
- 0: branch likely not taken
- 1: branch likely taken
optional data: none

Branch hints can appear only before a if or br_if instruction, and are considered attached to it. Code transformations that remove the instruction should remove the associated annotation, and transformations that flip the direction of the branch should preserve the hint but flip the kind.

yuri91 commented 3 years ago

@rrwinterton @jacob-abraham

dschuff commented 3 years ago

Overall I like this.

It does not affect the semantics of the module, and can be discarded.

A stronger (better?) way to say this is that the module must have the same semantics if the section is dropped. LLVM's spec for metadata also says that a transform must drop metadata that it doesn't know or can't preserve (because as you mentioned in the presentation, tools that modify the code can't safely propagate it). Probably this doc should also say something to that effect.

Where codepos is an offset in the code section, kind is an enumeration value whose meaning depends on the section kind, and data is a vector with a further payload, whose content depends on the section type.

I think "section kind" and "section type" in this sentence mean the same thing, right? Both the meaning of the annotation kind field and the data contents depend on the section kind/type?

The text format looks fine to me. Wrt the binary format: Option 1 is certainly more consistent with other similar uses such as debug info. It would probably be simpler for LLVM to generate (e.g. currently in LLVM function and code section symbol values are code section offsets). e.g. for option 2, I'd think LLVM would have to generate some kind of a symbol difference expression (i.e. the current instruction offset minus the function offset). I think in principle that could be resolved at assembly time rather than leaving behind a relocation, but I'm not sure: perhaps @sbc100 or @aardappel (or you) would know. But it does seem that option 2 could have a benefit for linking time, if we could avoid having a relocation for every branch.

I expect in Binaryen we could maybe use a similar technique to what we do for .debug_info (i.e. track instruction movements pre-and post-optimization, and then rewrite the metadata). Or if we were going to actually optimize based on the annotations (or create new annotations) we could do what we do with .debug_line and have some kind of data attached directly to the instruction in the IR, and generate the binary annotations afresh when writing the binary. That seems like it would be about the same either way (/cc @kripken)

jabraham17 commented 3 years ago

I really like this approach.

As far as Option 1 vs Option 2, I think that ultimately Option 2 has more pros. Option 1's main pro is that it is simpler to implement. I am not sure how much extra effort it would take to implement the necessary framework to allow function offsets in something like LLVM. Originally, for the instruction tracing prototype, we looked at doing function index and offset, but actually changed to code section offset because of this limitation in the tools we were using. This was a concession we made to get a working protoype quicker. Ultimately, I feel the benefits of Option 2, using function indexes, will prove better overall. This way if a function is not modified, we don't need to adjust all offsets for it.

I agree with @dschuff, I like how the text format is laid out.

yuri91 commented 3 years ago

Thanks @dschuff for looking at this!

A stronger (better?) way to say this is that the module must have the same semantics if the section is dropped. LLVM's spec for metadata also says that a transform must drop metadata that it doesn't know or can't preserve (because as you mentioned in the presentation, tools that modify the code can't safely propagate it). Probably this doc should also say something to that effect.

Right, I didn't think about it but actually llvm metadata and code annotations are similar (even though code annotations are more limited). I changed the text to include your suggestion here.

I think "section kind" and "section type" in this sentence mean the same thing, right? Both the meaning of the annotation kind field and the data contents depend on the section kind/type?

Yes, it was a typo. I am also totally open to change terminology for both "section type" and "kind" inside the annotation to something more clear, if somebody has ideas.

Option 1 is certainly more consistent with other similar uses such as debug info. It would probably be simpler for LLVM to generate (e.g. currently in LLVM function and code section symbol values are code section offsets). e.g. for option 2, I'd think LLVM would have to generate some kind of a symbol difference expression (i.e. the current instruction offset minus the function offset). I think in principle that could be resolved at assembly time rather than leaving behind a relocation, but I'm not sure: perhaps @sbc100 or @aardappel (or you) would know. But it does seem that option 2 could have a benefit for linking time, if we could avoid having a relocation for every branch.

In Cheerp we have a different backend for emitting Wasm code, so I don't know how easy it would be to use relative offsets in LLVM upstream.

I also see an additional benefit of option 2 that I didn't think about before:

Option 1 would use the relocation kind R_WASM_SECTION_OFFSET_I32 , which is not in LEB format. Option 2 would use R_WASM_FUNCTION_INDEX_LEB for the function index, and nothing for the offset. So the offset can be in LEB format. This makes Option 2 quite more compact in size on average. Of course we could add a new R_WASM_SECTION_OFFSET_LEB for option 1 potentially.

@tlively was against a variation of option 2 in another discussion if I recall correctly. Could you give us your thoughts on this?

I expect in Binaryen we could maybe use a similar technique to what we do for .debug_info (i.e. track instruction movements pre-and post-optimization, and then rewrite the metadata). Or if we were going to actually optimize based on the annotations (or create new annotations) we could do what we do with .debug_line and have some kind of data attached directly to the instruction in the IR, and generate the binary annotations afresh when writing the binary. That seems like it would be about the same either way (/cc @kripken)

In Cheerp we use the second strategy, and I think it is in general the easiest way for a tool to update annotation offsets (but I don't know much about Binaryen internals).

@jacob-abraham

Ultimately, I feel the benefits of Option 2, using function indexes, will prove better overall. This way if a function is not modified, we don't need to adjust all offsets for it.

That is my feeling too. I prefer Option 2, unless it proves really annoying to implement in LLVM.

tlively commented 3 years ago

If folks are willing to put in the effort to implement option 2 in LLVM and the implementation is not too complicated, then that sounds fine to me. My previous argument in favor of option 1 was only for the sake of reusing existing LLVM machinery.

dschuff commented 3 years ago

I also see an additional benefit of option 2 that I didn't think about before:

Option 1 would use the relocation kind R_WASM_SECTION_OFFSET_I32 , which is not in LEB format. Option 2 would use R_WASM_FUNCTION_INDEX_LEB for the function index, and nothing for the offset. So the offset can be in LEB format. This makes Option 2 quite more compact in size on average. Of course we could add a new R_WASM_SECTION_OFFSET_LEB for option 1 potentially.

I think this isn't necessarily true just because of LEBs. When the compiler encodes relocatable LEBs in an object file, there's a potential problem that the encoded number after linking can be larger, and therefore require more bytes to encode (e.g. if the LEB represents a function, and there are < 256 functions in the object file but more in the final binary). So in order to avoid this problem the compiler usually pads these LEBs out to the max of 5 bytes, which makes them larger than I32s. IIUC the linker has the ability to compress these LEBs to the appropriate size after link time (although come to think of it, that can change the section offsets further and could break non-relocatable relative offsets; I'm not sure how or whether we handle that). But you are right that it should be pretty straightforward to add a reloc type that's the same as another one but LEB instead of I32.

yuri91 commented 3 years ago

I think this isn't necessarily true just because of LEBs. When the compiler encodes relocatable LEBs in an object file, there's a potential problem that the encoded number after linking can be larger, and therefore require more bytes to encode (e.g. if the LEB represents a function, and there are < 256 functions in the object file but more in the final binary). So in order to avoid this problem the compiler usually pads these LEBs out to the max of 5 bytes, which makes them larger than I32s

Yes, but in general LLVM doesn't optimize Wasm for size, and people are expected to run wasm-opt before production, which will be able to take advantage of the small LEBs. Other toolchains can also directly produce the smallest LEBs directly.

IIUC the linker has the ability to compress these LEBs to the appropriate size after link time (although come to think of it, that can change the section offsets further and could break non-relocatable relative offsets; I'm not sure how or whether we handle that).

This is definitely something to keep in mind when adding support to lld, for both options. Does it work for offsets in debug info?

sbc100 commented 3 years ago

I think this isn't necessarily true just because of LEBs. When the compiler encodes relocatable LEBs in an object file, there's a potential problem that the encoded number after linking can be larger, and therefore require more bytes to encode (e.g. if the LEB represents a function, and there are < 256 functions in the object file but more in the final binary). So in order to avoid this problem the compiler usually pads these LEBs out to the max of 5 bytes, which makes them larger than I32s

Yes, but in general LLVM doesn't optimize Wasm for size, and people are expected to run wasm-opt before production, which will be able to take advantage of the small LEBs. Other toolchains can also directly produce the smallest LEBs directly.

IIUC the linker has the ability to compress these LEBs to the appropriate size after link time (although come to think of it, that can change the section offsets further and could break non-relocatable relative offsets; I'm not sure how or whether we handle that).

This is definitely something to keep in mind when adding support to lld, for both options. Does it work for offsets in debug info?

No, this optimization breaks debug info:

https://lld.llvm.org/WebAssembly.html#cmdoption-compress-relocations

--compress-relocations
Relocation targets in the code section are 5-bytes wide in order to potentially accommodate
the largest LEB128value. This option will cause the linker to shrink the code section to remove
any padding from the final output. However because it affects code offset, this option is not
compatible with outputting debug information.

aardappel commented 3 years ago

Maybe this isn't the appropriate place for this discussion, but I am really not a fan of a feature like this in the first place. Having to update locations in a secondary representation when modifying/optimizing code is very painful, and makes many tools harder to write correctly, and many chains of tools potentially lossy.

Once a feature like this exist, I bet we'll see many future proposals using it, as it is just easier than modifying the core instructions, meaning we'll rely on it more heavily, and it won't be so optional anymore, meaning tools not preserving it correctly will be more of a problem. And the more features use it, the more space-inefficient it will be, since encoding a single flag this way takes many more bytes than a single byte in the instruction stream.

So I wish people would instead try harder to make these features into extensions of existing instructions, or new instructions.

That said, if we're going to have this feature:

I don't think we need the kind field at all, since the section name uniquely identifies the feature, so the kind data may as well be simply the first byte of the data field? This is more generic. What if an annotation doesn't have a kind, but just wants to store a single float? etc.

As for 1 vs 2.. I'd lean towards 2 simply because it be nice to not have to update functions that aren't modified in some tools, and tools may find it more natural to deal with function offsets. The big downside of 2 may well be that if on average an annotation appears about once per function you have just increased the binary size once more with all these funcannotation records.

tlively commented 3 years ago

Oh good point about removing kind. I believe my originally proposed design had kind because I wanted all the annotations to go into a single section, but if we'd rather have separate sections for different types of annotations, then I agree that kind is not useful.

And FWIW, I think it's likely that many annotations will appear exactly once per function because they apply to entire functions.

jabraham17 commented 3 years ago

So I wish people would instead try harder to make these features into extensions of existing instructions, or new instructions.

This has been discussed in the CG, many of the extensions that are being considered for the annotation scheme were originally proposed as new instructions, and the consensus from the CG was they should be encoded as custom sections. Although it might add a larger overhead to tool creation, I think it provides a great way to extend features of WebAssembly that may not warrant new instructions but provide wanted/needed functionality.

I also agree about kind, It can easily be encoded in the data vector. I don't think that this will result in major space savings, but it does simplify the binary format. This also pushes more of the responsibility of adding meaning to these annotations to the end user of the feature, whether that is some kind of debugging tool or a runtime, which I see as an additional bonus.

yuri91 commented 3 years ago

Having to update locations in a secondary representation when modifying/optimizing code is very painful, and makes many tools harder to write correctly, and many chains of tools potentially lossy.

The point of this document is indeed to minimize this risk by having a common format.

Oh good point about removing kind. I believe my originally proposed design had kind because I wanted all the annotations to go into a single section, but if we'd rather have separate sections for different types of annotations, then I agree that kind is not useful.

I have a reason for keeping kind, but I think I didn't explain it well, and probably there is a better way. Anyway:

All the features that came up that could use code annotations, real (branch hinting, instrument&tracing) or hypotetical (compilation hints, inlining hint, ...?) are perfectly fine with just an integer value as payload. The data in my mind is just for future-proofing in case a more complex annotation will need more space. Using the data binary blob for the simple features makes them more opaque to tools, and more annoying for the text format.

Example: take instrument and tracing. With the current format using kind, a text annotation could be:

(@code_annotation.instrument 312) nop

Withoutkind, assuming that the argument is encoded as uleb, we get:

(@code_annotation.instrument "\xb8\x02") nop

Which is pretty obscure and ugly.

Moreover, if an annotation does not use data, a tool can infer that it does not affect / is not affected by non-local stuff (example: an annotation using data may encode a function index in it, which must be updated accordingly. One with empty data is only referencing its surrounding code).

Maybe a better way of doing the same thing would be a format like this:

codeannotationsec ::= vec(funcannotations)

funcannotations ::= idx: funcidx
                    vec(annotation)

annotation ::= funcpos: u32
               payload: subannotation_N 

subannotation_N(B) ::= N:byte
                      size:u32
                      B

with initially only a subsection_0(u32) defined. (Sorry if the formal notation is incorrect, I am trying to convey something similar to the subsections of the name section).

The previous text format example would become the following:

(@code_annotation.instrument 0 312) nop

tlively commented 3 years ago

Moreover, if an annotation does not use data, a tool can infer that it does not affect / is not affected by non-local stuff (example: an annotation using data may encode a function index in it, which must be updated accordingly. One with empty data is only referencing its surrounding code).

I don't think it would be safe to assume anything about unknown annotations, even if they have empty data fields. If we agree that tools can only safely update code annotations they know about, I don't see any benefit in pulling a single byte of their data out into a separate field.

yuri91 commented 3 years ago

I don't think it would be safe to assume anything about unknown annotations, even if they have empty data fields. If we agree that tools can only safely update code annotations they know about, I don't see any benefit in pulling a single byte of their data out into a separate field.

Well, we are here to decide what code annotations are, so we could just mandate this behavior.

But I am also ok with a lighter approach, and just attaching a bunch of bytes to a position in the code. Any annotation will then decide on its own how the bytes are interpreted.

One thing that I still don't like much is that the text representation will have to present the raw bytes, so a branch hint will look like:

(@code_annotation.branch_hints "\01") if

instead of

(@code_annotation.branch_hints 1) if

but I can live with that.

tlively commented 3 years ago

I propose that individual uses of this framework should feel free to define more descriptive alternative text formats for their data and that the raw byte string text format should primarily be used as a fallback for encoding unknown annotations.

So for branch hints, I would want both of those formats to be available, but for the more readable one to be preferred by tools that understand the branch hints feature.

yuri91 commented 3 years ago

I was looking at how llvm produces DWARF sections for wasm, to have an idea of what it would take to implement relative offsets as in option 2.

Am I wrong or are there already function-relative code offsets used? In the .debug_loc section I see entries like these in the .s files:

.Ldebug_loc0:
    .int32  .Lfunc_begin0-.Lfunc_begin0
    .int32  .Ltmp3-.Lfunc_begin0
    .int16  4                               # Loc expr size
    .int8   237                             # DW_OP_WASM_location
    .int8   0                               # 0
    .int8   0                               # 0
    .int8   159                             # DW_OP_stack_value
    .int32  0
    .int32  0

.Ltmp3-.Lfunc_begin0 seems to be a relative offset, expressed as a symbol difference, like @dschuff proposed.

I can't say I really grasp how the pipeline for generating these works, but I suppose that something similar could be done for code annotations.

If that is the case, I think that option 2 is the way to go.

aardappel commented 3 years ago

DW_OP_WASM_location so far indicate variable locations (locals and globals) not code locations. The relocation that may happen on that data is so far only for globals, see WebAssembly.h:

enum TargetIndex {
  // Followed by a local index (ULEB).
  TI_LOCAL,
  // Followed by an absolute global index (ULEB). DEPRECATED.
  TI_GLOBAL_FIXED,
  // Followed by the index from the bottom of the Wasm stack.
  TI_OPERAND_STACK,
  // Followed by a compilation unit relative global index (uint32_t)
  // that will have an associated relocation.
  TI_GLOBAL_RELOC,
  // Like TI_LOCAL, but indicates an indirect value (e.g. byval arg
  // passed by pointer).
  TI_LOCAL_INDIRECT
};

yuri91 commented 3 years ago

I dug deeper into how to implement option 2 in llvm, and I have a (very basic and incomplete) proof of concept: https://github.com/yuri91/llvm-project/tree/annotations

It just adds a dummy code annotation section with an entry for every br_if instruction, but it correctly produces relative offsets for the code offset and relocations for the function indexes, without adding any extra machinery.

I also updated the document with the suggestion of removing the kind field, and with the option 2 format. The updated version is in a PR to this repo (but we can continue discussing about it here) : https://github.com/WebAssembly/tool-conventions/pull/173

yuri91 commented 3 years ago

For people that may be interested in experimenting with code annotations (@jacob-abraham , @tlively?):

I added support for generic code annotations in Wabt: https://github.com/WebAssembly/wabt/pull/1724.

It can be used both for reading binary modules containing code annotations, and for producing them from the text format.

I find it useful for debugging stuff that produces or consumes them.

yuri91 commented 3 years ago

Since https://github.com/WebAssembly/tool-conventions/pull/173 was merged, I think we can close this. Further iterations on code annotations can live in their own issues.