faster-cpython / ideas

1.67k stars 49 forks source link

Remove all references to the instruction pointer, `next_instr` and the stack pointer, `stack_pointer` from bytecodes.c #618

Open markshannon opened 10 months ago

markshannon commented 10 months ago

The instruction pointer (IP) and the stack pointer (SP) are critical to the operation of the VM.

Because they are so performance critical, removing the need to constantly read and write them to memory is critical for performance. In the tier 1 interpreter, we write the IP every instruction, but avoid reads by caching it in a (C) local variable. We go further with the stack pointer and only store it when pushing or popping a stack frame.

In the tier 2 interpreter, we fully virtualize the IP, so that we don't even need to have it cached in a (C) local variable. We expect that in the copy-and-patch compiler we will fully virtualize the SP as well, allowing to keep some values on the stack in registers.

One optimization that is important for PyPy and was also valuable for HotPy is eliding frames for inlined funtions. To do that we also need to virtualize the frame pointer.

If we explicitly modify, load or store either the IP or the SP, we end up with ugly #ifdefs and hard to analyze code.

Obviously, we need to way to express modifications, loads and stores of the IP and SP. Micro-ops seems the obvious way to do this, although we should try to group operations for convenience and efficiency.

The operations we need are:

So, how do we define these micro-ops? In short, we define them in English, not C.

This means that all tools will need recognize and handle these special micro-ops, but as long as there aren't too many, that shouldn't be a problem.

markshannon commented 10 months ago

As a first step, we can gather all the code that uses the IP or SP into micro-ops defined in C. We can then turn those into special micro-ops later.

markshannon commented 10 months ago

The frame pointer should also be handled specially. We already special case _PUSH_FRAME and _POP_FRAME, but they don't just pop and push frames, which makes reusing them for SEND and YIELD_VALUE impossible.

gvanrossum commented 10 months ago

My general approach has been rather pragmatic -- figure out which tier 1 opcodes don't translate into tier 2 uops, then try to figure out how to make that happen, by just splitting them into multiple uops and/or by improving the tools. I've been procrastinating on SEND and friends, but when I get to them I will refactor _PUSH_FRAME and _POP_FRAME to do whatever is needed.

IMO things that need to happen at the start or end of every instruction make poor uops, since they just add dispatch overhead in the Tier 2 interpreter (I know, nobody cares about its performance). Unless, of course, there are ways to optimize these away, like we're planning for SAVE_IP. (For that to happen, we must first implement flags telling the translator whether a particular uop can ever deoptimize or error out -- this is relatively easily detected by checking for DEOPT_IF and ERROR_IF, except that we also still have a lot of instruction definitions that use (for whatever pragmatic or historic reason) goto error (~50 occurrences) or even (once) goto pop_1_error.

markshannon commented 10 months ago

I've had another look at this.

There are a few things that make micro-ops impractical for this: The code to save and load IP/SP is interwoven with other details. For example, RETURN_VALUE and YIELD_VALUE do different things with frame->return_offset. It is challenging to make micro-ops that will continue to be usable. For example, we should probably split RETURN_VALUE into a generator and non-generator version, as the frame clean up code. If the frame clean up code is part of a special micro-op we can't do that so easily.

Consequently, I think we will need special macros, not micro-ops. Where a micro-op is convenient to have, we can just wrap the special macro with a micro-op. E.g:

op(_SAVE_IP, ( -- )) {
    SAVE_IP();
}