faster-cpython / ideas

1.69k stars 49 forks source link

hook for JIT to support `_PyFrame_GetFrameObject` #575

Open carljm opened 1 year ago

carljm commented 1 year ago

A JIT will likely not want to maintain the full state of a _PyInterpreterFrame (most importantly, stack/locals; maybe also prev_instr) during execution of JITted code. In general, it should be the JIT's responsibility to ensure the interpreter frame is in the correct state before it returns control to the interpreter.

However, the functions PyEval_GetFrame and PyThreadState_GetFrame, which return a full PyFrameObject for the current interpreter frame, are public C API. PyThreadState_GetFrame is used internally by the coroutine_origin_tracking_depth feature and by PyErr_WriteUnraisable.

sys._getframe, sys._getframemodulename,sys._current_frames, and construction of tracebacks also reify one or more PyFrameObject from interpreter frames.

All of the above ultimately use _PyFrame_GetFrameObject, which takes a _PyInterpreterFrame and returns a PyFrameObject.

A JIT that does not constantly maintain the full state of the executing _PyInterpreterFrame will need some way to be notified when a full PyFrameObject is needed for one of its frames, so that it can populate the frame object appropriately based on its own internal execution state.

I'd ideally like to find a solution here that will work both for Cinder JIT and for future faster-cpython JIT, even given their differing designs.

So, questions:

  1. Is the feature request sensible? Is there an entirely different way that a JIT should handle this?
  2. What shape should the API take? Perhaps the simplest possibility is a single interpreter-wide function pointer that can be set, whose default value is _PyFrame_GetFrameObject.
carljm commented 1 year ago

For context, for a long time there was a similar hook (added for the sake of Psyco way back in 2002!), but it was removed in Python 3.9 (https://github.com/python/cpython/issues/84127). But the implementation of frames has changed a lot since then.

That hook allowed overriding PyThreadState_GetFrame entirely. That's more capable (because it allows the hook setter to track their frame stack entirely separately from the interpreter; I don't think Cinder needs that capability). But with today's implementation of frames, that form of hook wouldn't be sufficient, because e.g. sys._getframe doesn't use PyThreadState_GetFrame. If it did, it would have to walk the chain of PyFrameObject to the one requested, materializing every frame along the way, which is a lot less efficient than what it does today, which is walk the stack of _PyInterpreterFrame and only create a single PyFrameObject for the requested frame.

markshannon commented 1 year ago

The whole point of the _PyInterpreterFrame is that it is cheap to create. So, in theory, Cinder should be able to use it and not worry about frame objects at all, as the VM would take care of them.

How could we make _PyInterpreterFrame even cheaper, so that you can do that?

You'll need a link, so previous has to remain. The f_globals and f_builtins fields could be dropped, but that might hurt the performance of LOAD_GLOBAL. Alternatively we could restrict their use to the interpreter, so that Cinder could ignore them. We do need a way to access the globals and builtins, though, so f_funcobj seems useful. f_locals is the "slow" locals, so just needs to be NULLed. prev_instr is needed to find the line number. I guess we could allow it to be NULL, but setting it to NULL isn't much cheaper than setting to the start of the function.

There's a word of offsets and flags. They are introspected, so can be ignored.

That leaves the larger part of the frame: the locals and stack. The stack isn't visible outside of the interpreter, so there is no need to set the stack. That leaves the local variables. You'll need to store the values somewhere across calls (and there aren't many callee save registers left after the VM state), so why not put them in the frame? They can be kept in register between calls.

carljm commented 1 year ago

Yep, I'd already come to the same conclusion that a) we should just use _PyInterpreterFrame in 3.12, and b) populating most of it shouldn't be a big issue for us. The locals are indeed the sticking point.

That leaves the local variables. You'll need to store the values somewhere across calls (and there aren't many callee save registers left after the VM state), so why not put them in the frame? They can be kept in register between calls.

This is an interesting suggestion. We may need to do some experimentation to see how much worse it would be than our current approach (we spill to C stack only values stored in caller-saved registers.) Off the top of my head there's a few components that could potentially make it worse: a) locality and code size effects of spilling to Python stack instead of C stack, b) needing to always spill all locals, not just those stored in caller-saved registers, c) needing to consult runtime metadata that maps registers to local indices, which we currently only have to consult when we deopt.

carljm commented 1 year ago

After looking at this a bit, I think it's clear that spilling locals to the _PyInterpreterFrame around all calls is going to be too expensive to be feasible. It would be much better to find a solution that doesn't constrain an optimizer to always (on calls) maintaining locals (or ideally, anything else it doesn't need) in the _PyInterpreterFrame.

I also think a global hook maybe isn't a great idea; it makes it hard for multiple optimizers to coexist.

A better idea might be to reuse f_funcobj, since it's currently only needed internally to the interpreter, for fetching the function closure. So we could set that instead to a vector-callable object that knows how to fully update the _PyInterpreterFrame, and then we'd just need _PyFrame_GetFrameObject to do something like if (!PyFunction_Check(frame->f_funcobj)) { PyObject_Vectorcall(frame->f_funcobj, frame, 1, NULL); } before going ahead and making the full frame object. Or if the PyFunction_Check is too implicit, there's room for a few new bitflags on _PyInterpreterFrame without actually making it bigger.

carljm commented 1 year ago

After looking more into this, it seems like we could be OK with having f_locals on a JITted frame just be an empty dict. But there is one other problem: prev_instr. That ends up responsible for f_lineno on a frame, and we do need these to be correct. Updating prev_instr on the frame before every call (and every instruction that can trigger arbitrary code execution) would not be ideal for performance. So I think we still don't see a good way around having some kind of "frame reification" hook :/