idea: "Multi-return" functions for faster Rust `Result` and/or simpler exceptions

Ericson2314 commented 6 years ago

From the paper "Multi-return Function Call" (http://www.ccs.neu.edu/home/shivers/papers/mrlc-jfp.pdf). The basic idea from the perspective of compiled code is to include multiple return pointers in a stack frame so functions can return to different places.

Compared to `Result<T,E>`

This is denotationally the same as return a value of a Rust enum with 1 variant according to each of the return pointer slots, with fields according to the returned data associated with that slot (registers, spilled stack slots, etc). But with the naive enum calling convention of adding a tag field, the caller needs to branch on the tag field, even if the enum value was just created before the return so nothing is in principle unpredictable. In the common case of a function "rethrowing" a Err, (Err(e0) => ... return Err(e1) ... math arm), the native way results results on O(n) branches (one per stack frame) one each of the Err tags, while this way allows the error return pointer to point to disjoint control flow for the failure case, catching and rethrowing without additional branches, so the only 1 branch is the original failure condition.

Compared to unwinding

Success and failure control flow is implemented identically, avoiding significant effort on the part of compiler writers in maintaining a completely separate implementation concepts while optimizations can work with both, and don't get stuck on the success failure boundary. At run time, the lack of any DWARF-like interpreters reduces dependencies and simplifies things too.

In short, we have the asymptotic efficiency of unwinding with the implementation (compiler and run-time) efficiency of enum return.

I talked to @eddyb about this once and he said to talk to someone on #cranelift, but alas I am on IRC less these days and I forgot their nick. Opening this to make sure the idea isn't lost completely due to my negligence. Cheers.

sunfishcode commented 6 years ago

This is a cool idea! It's very interesting to think about using Cranelift to explore custom calling conventions for Rust.

(For actual unwinding, Rust's hands are tied in practice by the practical need to interoperate with existing C/C++ ABIs. And the DWARF unwinding logic is already provided, so all we have to do is provide the appropriate tables. But for Result<T, E> we're of course free to do whatever we want.)

One of my ideas in this space is to use the CF flag on x86 (other architectures have other architecture-specific things we could look at) to hold the discriminant of Result<T, E>, sketched out in a little more detail here. One thing that makes it similar to multi-return functions is that it would also benefit from the concept of a tuple of return types. It seems like there'd be slightly different performance characteristics, but it'd be good to experiment to get a better sense of how it plays out in practice.

If anyone is interested in laying the foundations for this kind of exploration in Cranelift, I think the first step here is to generalize Cranelift IR's concept of a function signature to permit a tuple of return types.

Ericson2314 commented 6 years ago

If anyone is interested in laying the foundations for this kind of exploration in Cranelift, I think the first step here is to generalize Cranelift IR's concept of a function signature to permit a tuple of return types.

Glad to here it! That's what i was thinking / hoping wouldn't get in the way of any other things. Sounds like a nice fun refactor.

sunfishcode commented 6 years ago

One further thought here: I think Cranelift IR wants a concept of a primary return type, optionally with secondary return types added. This might look a little unusual from the abstract perspective where it's just a tuple of return types, however I think it makes sense at the level of Cranelift IR.

When we get the the point where we're adding a version of the call instruction which supports multiple return destinations, the primary return can resume the EBB containing the call, and the secondary returns can target other EBBs. And, the lowering for secondary returns will look a little different than for primary returns, because the primary return can just use the hardware's normal return mechanism.

Ericson2314 commented 6 years ago

Well next like to do the empty list for diverging functions, since they don't return at all. Couldn't a convention that the first return type is the normal one suffice? For your CF example, the ABI is the same (flag must be checked either way). For mine, the only loss is short-hands like the ret instruction. The conceptual two steps of IP <- SP[N]; SP += M are the same, just N != M - 1 in general. For yours, the flag must be checked in all cases so the ABIs are the same!

sunfishcode commented 6 years ago

Oh, cool. It hadn't occurred to me that diverging functions would just have an empty tuple. But I see how that makes sense.

Currently in Cranelift IR, the last instruction in an EBB is required to use one of the opcodes which is specially designated as a "terminator". Regular calls aren't terminators, so for a call to a diverging function, maybe we could use a special call_diverging family of instructions. That would get awkward as we gain more kinds of calls, that each need a "diverging" variant.

There might be ways we could fix that, though this may take a bit more design work.

Ericson2314 commented 6 years ago

I went with a call_table terminator, which can deal with any number of return slots. Perhaps regular call can be extended to, but this plays nice with the existing surface syntax and was easy to do.

Ericson2314 commented 6 years ago

I've started implementing things. https://github.com/Ericson2314/cranelift/tree/multi-fun-sig is the signature generalization. https://github.com/Ericson2314/cranelift/tree/call_table is the boilerplate to add call_table and call (which would come later).

For the former, most of the work is boring returns -> multi_returns.get_single_return_unwrap(), in lieu of any instructions supporting multi-return. But one issue popped up already for legalization:

legalize_signature's current param sketches me out. I can't decide whether the regs/stack slots for additional return pointers should always be allocated (like args) or only sometimes allocated (like the link). I'd understand current better if this was just a caller- vs callee- saved distinction, but it isn't.

From the limited amount I know so far, I'd like to make link unconditionally part of the legalized signature:

Even on x86 we can say the first stack slot is the "link", AbiParam should support that.
For no-return functions, and custom calling conventions, we could reuse the link register/stack slot for something else if we wanted. So it strictly speaking isn't an unconditional constraint.

sunfishcode commented 6 years ago

Catching up here: Yes, making the x86 return value on the stack a "link" argument (when it needs one) sounds reasonable to me.

Ericson2314 commented 6 years ago

Thanks. How about the current parameter from https://github.com/CraneStation/cranelift/blob/bb267484ac46a4ba2a2bbd7bc3f9e60aeaf0ff52/lib/codegen/src/isa/mod.rs#L265 ? What makes link regs/stack slots --- caller-visible parts of the ABI for sure---only done in the "current" case?

sunfishcode commented 6 years ago

For the link register, Cranelift certainly does need to be aware of the link register Value passed into a function on the callee side, so that it can do the return. But the outgoing link register to a call is set by the call instruction itself in most ISAs (for traditional single-return calls). It's awkward to model this precisely, because the call would need to be a single instruction that defines the return address Value and also uses that Value itself as an argument to the call. Normally, instructions can't read values they produce.

For stack slots, where are they handled differently for the "current" case?

Ericson2314 commented 6 years ago

For stack slots, where are they handled differently for the "current" case?

I forget in the 2 weeks since I last touched the code :). I'm guess-remembering they may have been just implicit in both cases.

It's awkward to model this precisely...Normally, instructions can't read values they produce.

I think the solution is to get rid of the current and always model the link.

First, the machine's operation: Since the callee consumes the link (register or stack slock)---i.e after return the stack poiner is bumped or the register can contain something else---the call instruction indeed shouldn't produce or consume the return address. But is important that the callee have that register in it's ABI: a no-return function should be "called" with an unconditional jump and not a call instruction if the link stack/reg is to be repurposed for some other use.

If one thinks thinks of ABIs as types like then we have something like:

call :  reg
     -> (a = (pre-constraints + link-with-addr => post-constraints))
     -> ((reg = fun(a)) + pre-constraints => post-constraints + link-clobbered)
jump :  reg
     -> (pre-constraints => truely-empty-constraint-meaning-unreachable) 
     -> (reg = fun(a)) + caller-pre-constraints => truely-empty-constraint-meaning-unreachable)

So it's sort of like a higher order function (-> is the syntactic function type for assembly code operators, => is the "semantic" function type for instructions). The point being is that call's constraints to not precisely correspond to the constraints of the function being called. The finally call instruction has the same rough constraints, but those don't correspond exactly to constraints of the function being called.

My guess is current is sort of hack to recalculate the function's ABI so it more closely matches the call instruction's ABI, but that ignores the fact that call and jump require different things of their callee / destination address as I described first.

sunfishcode commented 6 years ago

I think you're right, and I agree that it would be nice to make caller and callee work the same way here. However I don't yet see how we can do that cleanly.

If we put the link in the callee's signature, we'll need to also add a Value as an actual argument to the call, because basic IR invariants require the number of actual arguments to match the number of formal parameters. The problem is, we don't have such a Value. It won't exist until the call instruction itself produces it. Basic IR invariants require that Value uses be dominated by their definitions, and an instruction's outputs don't dominate its inputs.

Ericson2314 commented 6 years ago

Where are these "Basic IR invariants" in the code? I'm very much willing to try teach them the higher-order nature of this stuff :).

sunfishcode commented 6 years ago

Instructions can't use values they themselves define:

https://github.com/CraneStation/cranelift/blob/master/lib/codegen/src/verifier/mod.rs#L868

Call arguments must match signature parameters:

https://github.com/CraneStation/cranelift/blob/master/lib/codegen/src/verifier/mod.rs#L1219

Ericson2314 commented 6 years ago

Thanks!

Ericson2314 commented 6 years ago

Sweet! So I think the first bit can stay the same. The second bit I'd change by adding some extra link args to match against the inferred virtual parms, so everything verifies up. Those args wouldn't come from any particular instruction per the normal way, but rather be made up out of thin air since they come from the "first half" of the call instruction. Initially, they'd be the same for every call instruction (for a given ISA), but once we get multi return, different call instructions would yield different virtual args.

[BTW, this also gets us closer to using br for tail calls. In that case the old link value needs to be routed to the right place like a normal argument. br wouldn't create any special arguments, so the currently-current-only link parameter and normally-routed link argument would need to match up just as the spoofed call-provided arguments would.]

Looking at that code reminds me, is there any issue to add argument/returns for br_table? I'd say it would make sense to so generalize br_table and remove current, then clean up and finish what I already started (call_table and multiple ret in sigs).

Ericson2314 commented 6 years ago

[Argument/returns for br_table also gets us closer to your trick for storing the tag in condition flags + scalar replacement of aggregates. (I realize now mine is fundamentally different as it is getting rid of the tag altogether rather than storing it differently.) In that case, there are some opaquely used registers, and based on the branch taken those registers are revealed to be used for something or actually unused, based on the revealed variant.]

sunfishcode commented 6 years ago

Ok, I'll be interested to see what this ends up looking like :-).

Adding arguments to br_table is theoretically doable, though it's not trivial. The big issue is the interaction with critical edges. For example, if we have a CFG like this:

     A     B    C    D
       \  /  \ /  \ /
        E     F    G

Suppose all these edges are branched via br_table, and for whatever reason we decide we need an argument on the branch from A->E. Now E has a parameter, so B needs an argument. But that means F needs a corresponding parameter, then C needs a corresponding argument, then G needs a corresponding parameter, then D needs a corresponding argument. This is pretty obscure, but this kind of situation where what should have been a localized transformation ends up needing us to ping pong around the IR to keep everything in sync is a corner case we'd like to avoid worrying about.

One idea we've tossed around is to change br_table's arguments from being a single sequence of values to be a sequence of values for each destination. For example, instead of:

     br_table v0, ebb7, jt0(v2, v3)

we might have:

    br_table v0, ebb7(v2, v3), jt0(ebb8(v2, v3), ebb9(v2, v3))

or so. Now we can do things between this branch and say ebb8 without affecting the other destinations.

All that said, if you have other ideas here, I'm interested!

sunfishcode commented 6 years ago

A few more notes:

Traditional SSA form with phis rather than branch arguments avoids this problem because the phis hold the actual arguments in the successors. This is why the approach sketched above works, because it provides the same information that phis would provide.
The reason we don't currently have this problem is that our regular conditional branches use a fallthrough when the condition is false, and the fallthrough path doesn't have parameters. br_table and indirect_jump_table_br (which br_table lowers to) are the only two branch instructions which have multiple successors that have parameters.

But also, thinking about this more, I think we may not actually need br_table for here. If you're just picking between a small number of return paths, a conditional branch or two will be much more efficient than a jump table.

gnzlbg commented 5 years ago

For actual unwinding, Rust's hands are tied in practice by the practical need to interoperate with existing C/C++ ABIs.

Note that if a foreign extern "C" { } (or extern { }) function declaration unwinds the behavior is undefined. Also, an uncaught panic in extern "C" fn aborts the process. That is, an implementation can assume that C foreign functions and extern "C" functions do not unwind.

Being able to interoperate with C++ ABIs that unwind would require extern "c++", and handling those ABIs will probably require inserting shims that translate Rust panics to/from C++ exceptions for the particular target (unless the extern "c++" fn or the extern "c++" { } declaration are explicitly specified to never unwind).

Ericson2314 commented 5 years ago

I've been looking at this again. I'm thinking of a roadmap of

"expose" link registers more. Risc-V return can explicitly consume %1 (or any GPR), rather than the function doing so with the current hack.
Remove current from legalize_signature altogether, check_call_signature uses a per-ISA method to provide the link registers so the the arguments + link and parameters line up.
Since only RISC-V leverages current today, I'm not sure whether the other ISAs need to change for the previous steps. If not, then change them now: x86's stack slot and ARM's link register should be exposed.
Expose abstract link values in the Cranelift IR. Now that each ISA understands the links, it can be an explicit value in the ISA. There be only one possible value to pass to return, but that's only for now...
Actually implement multi-return functions. However the call_table stuff shakes out, the function with have one "link parameter" per multi-return for it to use with return return ret0(v16) , return ret1(v16, v5), etc.

I'm wondering whether https://github.com/CraneStation/cranelift/pull/750 can immanently land or not, in which case I'll probably just change the rust "meta" and wait.

hanna-kruppe commented 5 years ago

Current discussion in other contexts has reminded me of this issue and I'd like to note that the "dynamically select between multiple return addresses to jump to" strategy has a serious performance issue that (as far as I can tell) wasn't mentioned in this thread and in the paper linked by @Ericson2314 at the start.

Modern CPUs use a specialized kind of branch prediction named return address stack (RAS) that exploit the usually strict pairing of call and return instructions to cheaply and very accurately predict return addresses. As far as I know, this feature is ubiquituous not just among x86 processors (where it's been everywhere since Pentium 1) but more generally among processors with any sort of dynamic branch prediction.

On any CPU with a RAS, returning to a different address than the one right after the corresponding call will always cause a misprediction (with all the costs that entails), assuming the "unusual return" is recognized as a return instruction by the CPU. If it's not recognized as such, that's even worse, because now there's a superfluous return address on the RAS (pushed by the call and then not popped by the non-standard return). This can cause multiple subsequent returns to be mispredicted.

So while this strategy is elegant at the language and ISA level, I don't think microarchitectures will enjoy it very much. It seems likely to me that these mispredictions will usually dwarf the cost of checking a tag and branching accordingly after each ordinary return, even taking into account that this will be replicated at each level of the call stack instead of jumping straight to the code that actually handles it (which I believe is rarely very far up the stack in practice, among other reasons due to inlining and drop glue that has to run).

rpjohnst commented 5 years ago

That cost might be reduced by replacing the corresponding call with a push; jmp- no effect on the RAS; instead just a normal indirect branch in the callee. The return+branch may be faster still but it's not black and white.

hanna-kruppe commented 5 years ago

Yes, sorry for glossing over the "avoid the RAS entirely" option. It's a possibility that may be competitive, but one has to be aware of the problem in the first place to do that.

Still, depending on what code sequences get interpreted as calls and returns (needs fragile uarch knowledge!) and what alternatives the ISA offers, it can increase code size and cycles needed. Furthermore, you'll probably lose some performance indirectly by leaving all your calls to the indirect branch prediction. It likely won't quite reach the misprediction rate of the RAS, and it can cause mispredictions elsewhere due to limited branch predictor capacity.

Ericson2314 commented 5 years ago

@rkruppe

This is an instance of the classic problem where hardware optimizing for current software and software optimizing for current software stifles innovation.

I think the best thing to do is just try to break the cycle. implement with workarounds like push, and implement in multiple places. (I hope to do this in GHC's native code generator someday.) Then, if webassmbly gets popular enough, one can hope and pray the hardware companies take notice. This isn't really a radical departure from the status quo, so one can hope!

hanna-kruppe commented 5 years ago

I'm not sure what sort of action you're expecting from hardware designers. RASes are used because they are a very effective and economical way to handle matching call/return pairs. Not building them just degrades peformance (at a fixed design point) of code following such a discipline, of which there will still be a ton. Having certain calls and returns exempt from the RAS is already generally possible with workarounds as noted, and even with better architeural support for that (equally short code sequence as for calls/returns that involve the RAS) the disadvantages of having these returns rely on the general indirect branch predictor instead of the the RAS remain.

Some variations of the RAS design that support e.g. two targets (but no more) seem vaguely plausible to me, but I'm not a hardware designer and it's probably not a foregone conclusion how well that would work.

Ericson2314 commented 5 years ago

@rkruppe Remember the alternative to this is a return and then an enum tag branch. It doesn't need to be perfect, just better than that. If it can predict the second branch directly, presumably it is already storing the information necessary so it is just a matter of storing it differently not storing more of it.

Some variations of the RAS design that support e.g. two targets (but no more) seem vaguely plausible to me, but I'm not a hardware designer and it's probably not a foregone conclusion how well that would work.

There's that or even an explicit pop with normal stack, under the assumption that the Err path is cold and the mispredict isn't so bad.

In the two branch case, you can perhaps assume that Ok leads to OK, and Err leads to Err, which can speed up the error case to the extent that matters.

rpjohnst commented 4 years ago

This paper does some measurements on a bunch of different call stack implementations, including a small section on the impact of (not) using native call/ret instructions: https://kavon.farvard.in/papers/pldi20-stacks.pdf

bytecodealliance / wasmtime