sunfishcode commented 6 years ago

Funclets: Flexible Intraprocedural Control Flow

Background

Tail calls can be used to create arbitrary control flow patterns, however tail calls are difficult to implement as efficiently as intraprocedural control flow. The key difference is that in intraprocedural control flow, implementations can easily determine all the entry points into a region of code. Statically determining the callees of a function requires full interprocedural knowledge in the best case, and is undecidable in the worst.

Also, one of WebAssembly's goals is to be easy to target from a wide variety of compilers, and to avoid biasing towards one style of compiler over another. Currently WebAssembly makes one very popular style of compiler awkward, and it sometimes leads to the use of inefficient code.

And, common dynamic-language compilation techniques involve dynamically patching control flow paths within a function, which is something that WebAssembly may want to support in the future. Such functionality isn't included here, but the proposal here would be a natural base on which to design such a feature.

The following is a proposal for a solution, which carefully preserves key properties of wasm: linear-time decoding and validation, fast SSA construction, and efficient baseline compilation. The issue here is the first step in phase 0 of the CG process.

Overview

This adds a new control-flow construct called a funclet region. A funclet region contains a sequence of funclets, which are essentially small functions that can tail call each other, passing arguments on the operand stack.

Funclet regions nest within and can be nested within wasm's existing control-flow operators, so producers can use the existing control-flow constructs when they make sense, and funclet regions when they need extra flexibility.

New Opcodes

funclet_region

Field	Type	Description
`signature`	signature	the funclet region's signature, similar to a `block` signature
`num_funclets`	varuint32	the number of funclets belonging to the funclet region (and must be non-zero)

This starts a funclet region with num_funclets funclets, and starts the first funclet.

Similar to block, funclet_region adds a control flow nesting level, with branches (br, br_if, etc.) to this level branching to the end of the funclet region. The signature immediate has the same role as block's signature.

funclet_sig

This declares an explicit type signature for a funclet, which is needed if the funclet isn't branched to from above. This opcode may only appear at the beginning of a funclet.

Field Type Description

signature signature the signature for the funclet

num_preds varuint32 the number of backward funclet callers of the funclet

See the SSA construction section below for an explanation of the num_preds field.
funclet_call, funclet_call_if

Field Type Description

delta varint32 the relative index of a funclet to tail call

These are the unconditional and conditional instructions which allow one funclet in to tail-call another in the same funclet region. funclet_call is to funclet_call_if as br is to br_if.

These are only valid within a funclet region.

Field	Type	Description
`signature`	signature	the signature for the funclet
`num_preds`	varuint32	the number of backward funclet callers of the funclet

Field	Type	Description
`delta`	varint32	the relative index of a funclet to tail call

funclet_call_table

Field	Type	Description
target_count	varuint32	number of entries in the target_table
target_table	varint32*	target entries that indicate the relative index of a funclet to tail call
default_target	varint32	the relative index of a funclet to tail call in the default case

funclet_call_table is to funclet_call_if as br_table is to br_if.

Funclets

A funclet is ended by an end or by any unconditional control transfer (br, br_table, return, funclet_call, funclet_call_table, or unreachable), after which the next funclet begins. A funclet region ends when the last of its funclets ends.

When a funclet performs a funclet call or an end to another funclet, all values on the stack above the point where the funclet region was entered are the arguments in the tail call to the callee. In the case of an end in the last funclet in a funclet region, the funclet region exits in the same manner as block.

Funclets don't have local variables of their own, but they can access the local variables of the containing function.

Linear-time Validation

The main requirement for any binary code format to have linear-time type checking is to ensure that when decoding reaches a control-flow merge point, the types of all variables entering the merge point are known.

Every funclet has a signature, which contains just argument types, and no results, because control transfers between funclets are always tail calls.

For now, if the first funclet in a funclet region has no funclet_sig, it has no arguments. This may be generalized to match the proposal to make blocks accept arguments (which is part of the multi-value proposal).

When validation reaches a funclet called only from above, and there's at least one call from above, then all the incoming funclet calls will have been seen, so the type can be inferred.

When validation reaches a funclet that can be called from below, or is not reachable from any call, it must begin with a funclet_sig instruction, which provides a signature as an immediate value.

SSA Construction

SSA construction within funclet regions can be performed using the well-known "Simple and Efficient Construction of Static Single Assignment Form" algorithm by Braun et al.

funclet_sigs contain an immediate value of the number of backwards callers, to allow funclets to be "sealed" (using terminology from the paper) as soon as the last caller of a funclet is processed. This allows for greater on-the-fly optimization.

Implementation Options

While some implementations may implement flexible intraprocedural control flow directly, it would also be feasible to translate funclets as individual functions using regular interprocedural tail calls, as an implementation detail. Funclets need to be able to access the local variables of the enclosing function, however this can be arranged either by passing them as arguments, or by passing a closure pointer as a hidden argument.

Examples

A very simple loop:

    funclet_region 1  # start a funclet region with 1 funclet
      funclet_sig     # the top of the loop can be called from below
      ...
      funclet_call_if 0 # loop backedge, calls to funclet 0 (the same funclet, so the delta is 0)
      end

A very simple control flow diamond (if-else):

    funclet_region 2    # start a funclet region with 2 funclets
      ...               # define the if condition value
      funclet_call_if 1 # test the condition, call to the false arm or fall through to the true arm
      ...               # the true arm (still in the first funclet)
      br 0              # branch to the `end`. This ends the first funclet, the second follows.
      ...               # the false arm, starting the second funclet
      end

A wasm if-else nested inside a funclet region. Note that nested constructs nest inside of funclets, so this only uses a single funclet:

    funclet_region 1  # start a funclet region with 1 funclet
      ...
      if
      ...
      else
      ...
      end             # end the if-else
      ...
      end             # end the funclet region

An example with an interesting funclet region signature:

    funclet_region 2 (result i32) (result f32) # start a funclet region with 2 funclets, and two results
      ...
      funclet_call_if 1 # start a CFG diamond
      ...               # the true arm
      i32.const 2       # arguments to pass to the tail call
      f32.const 3.0
      br 0              # exit the funclet region
      ...               # funclet 1, the false arm
      i32.const 4       # validation can now check the argument types
      f32.const 5.0
      end

An example with an interesting funclet signature called from above:

    funclet_region 3 (result i32) (result f32) # start a funclet region with 2 funclets, and two results
      ...
      funclet_call_if 1 # start a CFG diamond
      ...               # the true arm
      i32.const 2       # arguments to pass to the tail call
      f32.const 3.0
      funclet_call 1    # call funclet 2
      ...               # funclet 1, the false arm
      i32.const 4       # validation can now check the argument types
      f32.const 5.0
      end               # call funclet 2
      ...               # funclet 2, takes two arguments
      end

An example with an interesting funclet signature called from below:

    funclet_region 3  # start a funclet region with 3 funclets
      funclet_call 2  # call funclet 2
      funclet_sig (param i32) (param f32)   # funclet signature
      ...
      br 0            # exit the funclet region
      i32.const 2     # arguments to pass to the tail call
      f32.const 3.0
      funclet_call -1  # call funclet 1, passing it two arguments

aardappel commented 6 years ago

I love how elegantly this integrates with the existing wasm structured control flow. I think this would be a pretty important addition to support a wider range of languages and wasm producers.

kumpera commented 6 years ago

@sunfishcode Out of curiosity, could you explain how you envision this proposal being later extended to the following?

And, common dynamic-language compilation techniques involve dynamically patching control flow paths within a function, which is something that WebAssembly may want to support in the future. Such functionality isn't included here, but the proposal here would be a natural base on which to design such a feature.

lukewagner commented 6 years ago

@kumpera I prodded @sunfishcode into including this one addendum (since I think it should be a consideration for any new "lower-level control flow" construct), so I'll take a quick stab, of course welcoming additional comments from @sunfishcode.

The idea is that there'd be some "patchable" versions of wasm control flow ops that would only declare the signature of their target funclet, allowing the actual target be patched at runtime (to a signature-compatible funclet). Combined with a separate mechanism for efficiently compiling individual or small batches of funclets at runtime, this could allow classic dynamic language Inline Caching techniques. The default target of an un-patched control flow op would box the values passed via the signature and call a generic handler function which could then compile and patch in new funclets based on those values. Lastly, if the dynamically-compiled funclets themselves could contain patchable control flow, then this could allow a wide spectrum of dynamic compilation techniques (polymorphic ICs, tracing, versioning, ...). Obviously a lot of hand-waving going on here, and I don't mean to derail the immediate proposal with this farther-out idea.

binji commented 6 years ago

A funclet is ended by an end or by any unconditional control transfer (br, br_table, return, funclet_call, funclet_call_table, or unreachable), after which the next funclet begins. A funclet region ends when the last of its funclets ends.

Is there a reason for this behavior? This seems like it is reverting to a previous design for handling unreachable code than the one we ultimately went with.

When validation reaches a funclet called only from above, then all the incoming funclet calls will have been seen, so the type can be inferred.

This seems odd to me too -- we don't currently infer types anywhere, so why here?

delta - the relative index of a funclet to tail call

Why are these values relative rather than absolute?

num_preds - the number of backward funclet callers of the funclet

Can you explain a bit more why this is required? How would SSA construction be more difficult if it were not included?

I'm also curious about what would happen if you nest funclet_regions. Is it possible to reference an outer region?

sunfishcode commented 6 years ago

A funclet is ended by an end or by any unconditional control transfer (br, br_table, return, funclet_call, funclet_call_table, or unreachable), after which the next funclet begins. A funclet region ends when the last of its funclets ends.

Is there a reason for this behavior? This seems like it is reverting to a previous design for handling unreachable code than the one we ultimately went with.

When a block is exited early via a br or similar, there's no way for a front-end to declare the type state it would want following the br. With funclets, frontends can just declare the signature they want after a funclet_call or similar with a funclet_sig, right there.

That said, this does prompt me to notice that I didn't mention that unreachable funclets need a funclet_sig. I'll fix that.

When validation reaches a funclet called only from above, then all the incoming funclet calls will have been seen, so the type can be inferred.

This seems odd to me too -- we don't currently infer types anywhere, so why here?

This isn't essential, but it significantly reduces code size in common cases. This way, most funclets won't need explicit signatures.

delta - the relative index of a funclet to tail call

Why are these values relative rather than absolute?

The idea is to promote compression, in two ways:

Common localized control-flow patterns will typically use the same deltas each time they appear.
Short jumps can always use small relative immediates, regardless of their absolute position in a funclet region.

num_preds - the number of backward funclet callers of the funclet

Can you explain a bit more why this is required? How would SSA construction be more difficult if it were not included?

The linked paper has the details, but in brief, knowing how many predecessors there are for a block means you can know when you've seen its last predecessor, at which point it can be "sealed", meaning you have all the inputs to its phis and you can do more on-the-fly optimizations.

I'm also curious about what would happen if you nest funclet_regions. Is it possible to reference an outer region?

No, funclets wouldn't be able to call into outer regions -- that would upset wasm's basic nesting structure. This is what "in the same funclet region" above means to convey, though I clearly need to make that more precise :-).

AndrewScheidecker commented 6 years ago

Is it necessary for funclets to be completely distinct from the ordinary control flow operators? It seems like this proposal could be reduced to just the funclet_region and funclet_sig opcodes, with the funclet_call, funclet_call_if, funclet_call_table operators mapping to br, br_if, and br_table.

rossberg commented 6 years ago

Which in turn is equivalent to just a multi_loop statement defining multiple mutually recursive loops that can branch to each other. That can express every CFG.

The main functionality this adds is irreducible control flow. It is perhaps a bit odd that the proposal doesn't mention this, given that it represents a significant departure from the current philosophy.

sunfishcode commented 6 years ago

@AndrewScheidecker That's right; it's not strictly necessary. br etc. have unsigned immediates, and we could choose to interpret the immediates as absolute funclet indices, for example. The main diffference of what I've described above is that it allows us to have signed relative immediates, which I expect will be more compressible. However, we can make changes as the proposal progresses and we learn more.

@rossberg Functions that tail call each other can also express every CFG, including irreducible CFGs. The proposal here just proposes a form of tail calling that supports different implementation techniques. Concerns I'm aware of are that we want to preserve linear-time validation, support fast SSA construction, and provide options for implementations that aren't able to represent this kind of control flow directly. I've addressed these concerns in the proposal. Are there other concerns?

rossberg commented 6 years ago

VMs don't typically build inter-function CFGs (or only in limited circumstances), so I don't view tail calls as being in the same category.

IIRC, last time this was discussed the biggest concern was that some compiler algorithms or their implementations in existing VMs cannot handle such control flow as efficiently, or at all, making jits both more complex and slower with such a feature. And if they fall back to complex transformations instead then you just made matters worse by pushing expensive work that could be performed offline into the jits.

aardappel commented 6 years ago

@rossberg the difference with this proposal is that any inefficiency an engine may introduce for this feature would be local to just the use of this feature, and not affect any existing control flow current backends may emit. This allows engines to gradually address it (and backends to choose adopting it or not). As opposed to all control flow being unstructured, which creates an "all or nothing" implementation problem for engines.

rossberg commented 6 years ago

@aardappel, I don't know if that is true, unless you are willing to duplicate the compilation pipeline. Generalising an existing algorithm can well increase the cost for all cases.

aardappel commented 6 years ago

@rossberg I also don't know that it is not true, since we currently have very little information what the exact impact this would have on engines. It would be good to hear from implementors, since this is different from having only unstructured control flow. If indeed the mere presence of this feature would slow everything else down irreparably, I'd concur that we probably don't want it.

titzer commented 6 years ago

I agree with @rossberg in that this proposal's primary delta in expressiveness is only in allowing irreducible control flow (i.e. the ability to construct loops with multiple entry points). The "funclets" name and "tail_call" names obscure that this proposal is a blocklist+gotos. As such it introduces a second (redundant IMO) mechanism for control flow. It also introduces type inference for blocks, a potential source of complication (e.g. with unification/merging in the future).

If we decide we care about irreducible control flow enough to introduce a mechanism into WASM to express it explicitly, I think we should design a mechanism specifically for it--one that makes it explicit and contained, with an eye on making it easy for engines to identify and handle. E.g. a nice paper with some formalism around irreducible graphs generalizes multi-entry loops here (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.4084).

I also do not understand how this proposal is applicable to future patching of code, or what in WASM now would make that hard.

binji commented 6 years ago

Added a funclets repo here: https://github.com/WebAssembly/funclets, as we agreed in the Oct 2 CG meeting.

sunfishcode commented 6 years ago

Thanks! The repo is now populated with this proposal.

To address @titzer's comments: If there's a way to extend WebAssembly's existing mechanisms to meet the goals here, I'd be interested.

If future WebAssembly type system changes create tricky cases for the modest type inference happening here, one option would be to require funclet_sig instructions in those cases, declaring explicit signatures.

The proposal is just at stage0, and we can certainly consider other representations. From my reading of that paper, DJ graphs:

have at least as many edges as corresponding CFGs, and may have more, so they could increase code size
can be invalid in interesting ways, so they seem to be more work to validate
would be more complex to convert into explicit control transfers, which would hamper baseline compilation

According to the paper, DJ graphs can be constructed from CFGs in linear time, so implementations which want DJ graphs could quickly build them. Consequently, it appears DJ graphs aren't an advantageous choice for the binary encoding. I'd be happy to be corrected, or to learn about other systems worth investigating.

Concerning code patching, many designs are of course possible. The idea of a funclet region which could be later extended with additional funclets, and of branches which could be rerouted to different funclets, is relatively straightforward. And it's somewhat close to the level of abstraction that users of such features often work at, and somewhat close to the hardware features it's abstracting over.

AndrewScheidecker commented 6 years ago

I'm in favor of adding a way to explicitly express irreducible control flow, but not by adding new concepts (funclets/funclet calls) that parallel the existing control flow concepts (labels/branching and functions/calls). Why does this need to anything more than one new control structure operator? (e.g. multi_loop, as suggested by @rossberg).

Maybe "funclets" are a good way to expose code patching, but this proposal doesn't include code patching, and absent code patching seems unnecessarily complicated.

sunfishcode commented 6 years ago

@AndrewScheidecker I'd like to understand your comment better. Are you concerned about the number of new opcodes here? About the terminology? Both :-)?

AndrewScheidecker commented 6 years ago

Are you concerned about the number of new opcodes here? About the terminology? Both :-)?

My objection is that it seems unnecessary to add the concept of funclets in order to produce the incremental functionality that is added by this proposal: irreducible control flow. I think you have an idea of how to extend funclets to enable efficient code patching that may motivate the complexity, but given that there seems to be a simpler way to add irreducible control flow, I think it only makes sense to add funclets in combination with enough functionality to justify the additional complexity.

conrad-watt commented 6 years ago

I also feel that, if we need irreducible control flow, something like multi_loop would be a more local proposal that doesn't introduce a disjoint family of opcodes. One could also break out of a multi_loop to another enclosing multi_loop, something which, if I'm interpreting your previous comment correctly, isn't possible with funclets.

I also want to echo @binji's comments that there seem to be decisions in this precise proposal that go against existing conventions of the spec. But I can imagine a version of the proposal without them, so I don't think this factors into its existential merit for now.

sunfishcode commented 6 years ago

Code patching is not what gives this proposal its shape; it's just something that could sit on top.

Most of what you see here comes from working within the existing language, and optimizing code size. For example, except for code size, it might make sense to just use br etc. instead of funclet_call etc., as I mentioned above. And except for code size, we could choose to forego the type inference, or maybe even eliminate the ability to pass arguments on the operand stack altogether; either of those could obviate the funclet_sig opcode.

Perhaps I should write a version of this proposal that starts with just the minimal feature set, with just one opcode, and then describes the optimizations separately. It's always been the plan to evaluate various tradeoffs once we start prototyping. Having a default set folded into the proposal has helped me get useful feedback, but I can split things out too.

conrad-watt commented 6 years ago

My personal feeling is that this is over-optimizing on the size of immediates and type signatures at the expense of consistency with the rest of the spec. I might be finding it hard to properly appreciate the kind of large programs wth lots of goto this would matter for, though.

For br vs funclet_call, a clever compiler that strongly cared about the size of the immediate could order the most branched-to funclet/multiloop bodies earlier in the construct, reducing the overhead of an absolute indexing scheme.

For the type signature/inference, this would severely complicate the addition of any future feature that requires polymorphic types (the GC extension?).

nmsmith commented 5 years ago

Have there been any new thoughts in this space? After thinking about this proposal for a while, it does seem like it's essentially a multi_loop with a bit of metadata (num_preds) to hint at who can call who. That said, I think a multi_loop would be a great way to model a lot of abstractions, such as state machines, in a manner that is amenable to efficient compilation. Indeed, the whole point of the proposal seems to be to allow more efficient optimization strategies than can be implemented for module-scoped function call graphs. I give it my +1 for that.

I do appreciate that existing Wasm JITs don't currently have any logic to optimize something like a multiloop, and so they may have to transform them to function definitions (and possibly closures). Will that unacceptably increase page load times for large browser-based Wasm apps? That's unclear, but if it does there is a simple fix for programmers: benchmark your page load time and don't compile to multiloops if it's unacceptably longer. As for the suggestion that simply having multiloop handling integrated into the JIT might reduce performance for Wasm apps that don't even use multiloops: we'd need a browser implementer to investigate that. They could even add a few stubs in the places where the logic would need to be generalized and run a benchmark. If the result comes back negative, then there's nothing blocking the viability of a multi_loop feature.

sunfishcode commented 1 year ago

The current state of the funclets proposal is that I myself am no longer pushing it forward. If anyone else would like to, I'd be happy to help them get started. That said, I agree with Conrad's observation here that all the code-size optimizations in the funclets proposal aren't that important at this point, and a much simpler proposal based on multiloop would likely be the better path forward.

WebAssembly / design