faster-cpython / ideas

1.67k stars 49 forks source link

JIT: Register passing proposal & reducing stack traffic significantly #663

Closed Fidget-Spinner closed 3 months ago

Fidget-Spinner commented 3 months ago

This proposal is less of "register allocation" and more of "register passing". However, it should achieve the same effect. Inspired partly by Brandt's proposal.

The optimizations

There are two optimizations proposed here:

  1. Pass the top 5 arguments of the stack via the C calliing convention, rather than through the CPython operand stack.
  2. Elide all CPython operand stack pushes and pops when operating on these arguments. So no stack traffic. This takes inspiration from Pyston's JIT compiler (called deferred value stack there). I also want to remind that Pyston full is 65% faster than 3.8 on pyperformance. Which is faster than 3.12 and likely 3.13, so it's probably doing something right there :).

How this will look like

The template will be changed to this:

// 5-argument form
// reg 0 - TOS
_Py_CODEUNIT *
_JIT_ENTRY(_PyInterpreterFrame *frame, PyObject **stack_pointer, PyThreadState *tstate, PyObject *reg0, PyObject *reg1, PyObject *reg2, PyObject *reg3, PyObject *reg4)

The tail/call continuation will thus look like this:

// PATCH_JUMP macro expanded

// 5-argument form with variadic args
((jit_func)&_JIT_CONTINUE)(frame, stack_pointer, tstate, reg0, reg1, reg2, reg3, reg4);

How this will be generated

At build time, anything that uses the top 5 stack operands will not push/pop from the CPython operand stack. Instead we rewrite the stack input/output effects in the case generator to access directly from those args.

Overall, these should have a significant speedup. Register allocation from the paper is IIRC the second most worthwhile optimization after zero-length jumps. Not just that, but we eliminate a lot of stack traffic except for some cases.

How to handle deopt, side exit, and error

Thankfully not complex. For a uop, push all their inputs (reg0 - reg4) to the stack before exiting to the interpreter.

Concerns

  1. Stack overflow -- we can just make the abstract interpreter in the optimizer bail when it sees too large of a stack.
markshannon commented 3 months ago

These seems rather vague. Do you have an algorithm? How would the work be split between the code generator and optimizer?

Why 5 arguments? The optimum number will vary from platform to platform.

How does this differ from top of stack caching as proposed in https://github.com/python/cpython/issues/115802?

Fidget-Spinner commented 3 months ago

It's the same as stack caching for now. closing this