faster-cpython / ideas

1.67k stars 49 forks source link

Proposal for modified frame layout in 3.13 #657

Open Fidget-Spinner opened 4 months ago

Fidget-Spinner commented 4 months ago

I've decided to drop the plan of true function inlining for 3.13. The main problem is that debuggers use frame.f_locals modifications to work, which would break my current inlining plan.

Instead, I have another proposal that gets most of the frame inlining benefits, without the reconstruction requirements. This involves changing the frame layout slightly

image

The After layout has the following benefits:

  1. Reduced overhead via interleaved stack and locals. No copying of arguments at all for most tier 1 specialisations and simple function calls. The old frame's stack becomes the new frame's locals.
  2. Makes frame inlining easier in the future, as no copying during reconstruction is needed now.
  3. Makes the "hot" (commonly accessed) parts of the frame smaller, likely fitting better in cache.
  4. No complex reconstruction needed for sys._getframe.

There are some cons:

  1. Worse locality of data for cold fields in frame. Things like globals, builtins, code object, code attribute names, and constants are now a separate memory cache line than the frame's localsplus. I don't think this is a con with though (in fact, this is a plus). Those fields are rarely used once globals get promoted to constants, and all constants become inlined loads. We already rarely use attribute names already in tier 1 as specialisation doesn't use it. Making localsplus fit in a cache line is better.
  2. One extra local variable needed in eval frame to hold reference to new localsplus to avoid one more pointer indirection in load and store fast. Not sure if this will affect register pressure.
markshannon commented 4 months ago

Exactly how would a frame be laid out? Could you give an example frame listing all of the fields?

gvanrossum commented 4 months ago

The main problem is that debuggers use frame.f_locals modifications to work, which would break my current inlining plan.

There's got to be another way to address debuggers' needs. Surely when a debugger sets a breakpoint we can invalidate the affected executors? (That's what the valid flag is for, and the _CHECK_VALIDITY_AND_SET_IP uop.)

I also don't understand how this gets us any of the benefits claimed for function inlining. Is it just that you are planning to overlap the locals of Frame2 with the stack of Frame1 (the overlap being limited to the positional arguments)? That's an idea we've had before (#471), but it was ultimately rejected, in part in the expectation that true call inlining would get us most of the benefits. :-)

Fidget-Spinner commented 4 months ago

The main problem is that debuggers use frame.f_locals modifications to work, which would break my current inlining plan.

There's got to be another way to address debuggers' needs. Surely when a debugger sets a breakpoint we can invalidate the affected executors? (That's what the valid flag is for, and the _CHECK_VALIDITY_AND_SET_IP uop.)

Yes we can invalidate. But I don't think we can reconstruct. The correct stack_pointer is local to ceval.c. So there's no way to access it, unless we have a SET_SP instruction or similar on every impure instruction (then again, the overhead of this might be high enough that would this form of inlining be worth it?).

I also don't understand how this gets us any of the benefits claimed for function inlining.

Yes I just plan to do the overlap being limited to positional locals. That gets us half of the way there. If we use this scheme, then the inlined frame's reconstruction is just re-materialising the frame specials, with no movement or copying of the localsplus needed. This means we don't need to care about where stack_pointer is, because post-materialization, it will still point to the correct location. This is also how _PyInterpreterFrame's reconstruction to PyFrameObject works.

gvanrossum commented 4 months ago

It looks like "specials" is the entire struct _PyInterpreterFrame except for the fast-locals array (which is appended at the end using the common "variable-length-array-at-the-end" pattern). You can tell from FRAME_SPECIALS_SIZE:

#define FRAME_SPECIALS_SIZE ((int)((sizeof(_PyInterpreterFrame)-1)/sizeof(PyObject *)))

It's about 9 pointers worth, or 72 bytes per frame. (I hadn't appreciated that before -- I vaguely recall that we had something called "specials" that was just three pointers -- globals, builtins and one other --, but I can't find it in any releases.)

The specials, locals and stack (i.e., the entire struct, including the variable-length array at the end), are allocated contiguously by the allocator for frame structs -- it uses a chunked allocator whose internal data structure is referred to as the "data stack". Allocation here is highly optimized (the main allocation function is a static inline _PyFrame_PushUnchecked). Contiguous allocation is the name of the game here, both for locality and to make the allocator super fast.

If we were to move to the proposed alternate layout, it seems we would need two separate allocation data structures, one for the specials, another for the localsplus data (and they'd grow at different rates, so the chunking code would become quite a bit hairier). This feels like it would slow down the allocator and weaken locality of memory references, so it would have to have quite significant other benefits besides solving some issue with the debugger. (Though I realize that you think that memory locality will actually be improved -- you may be right, we will have to measure.)

For many small functions (the ones where I'd assume that call inlining would be the most effective) I expect that the benefit from the overlap approach (not having to copy the arguments) is probably not so large -- these usually have 1-2 arguments, so the overlap would be only 1-2 pointers.

[...] the inlined frame's reconstruction is just re-materialising the frame specials [...]

"Just", but it's 8 pointers, an int, an uint16, and a char, that would have to come from various places. (I realize you have coded this once already, but I've lost my pointer to your branch, so I can't check it to see if maybe I'm overlooking some simplifying trick.)

All in all I am still stuck in a mindset where I don't understand how frame reconstruction gets all the needed pieces, which makes it difficult to assess your proposal.


Now, regarding executor invalidation and debuggers. It sounds like there are two cases: debuggers, which set breakpoints or step through code, and sys._getframe() (or its C API equivalent). I'd like to focus on single-threaded scenarios. (A debugger could be started from a different thread, but that's either synchronized by the GIL or by the free-threading stop-the-world feature, and in either case the target thread has to reach a CHECK_EVAL_BREAKER() macro, at which point it can side-exit (reconstructing frames) at leisure.)

So something has to invoke the debugger or sys._getframe(). Of course, there could just be a call to the latter in the executor. Or it could be hidden in some DECREF. So, okay, if that happens inside an inlined function, we are indeed stuck with the need to reconstruct a frame struct (and possibly a frame object) for the frame that's making the call, and this does feel awkward.

Then I agree that frame reconstruction is a tough job, and we can't count on magical executor invalidation to warn us. :-(

But maybe we can do a limited form of call inlining, where we only inline calls that contain no uops with the HAS_ESCAPES_FLAG flag set? There are plenty of those. Then at least we know we won't encounter sys._getframe() (nor a surprise in a DECREF, assuming our escape analysis is correct). There are still the side exits, deopts, and errors to deal with, but those at least are predictable.

To me, doing limited inlining while keeping the frame stack layout unchanged feels more likely to be beneficial than the infrastructure work needed to separate the "specials" from the locals-plus. But I'm just a bear of very little brain...

gvanrossum commented 4 months ago

(Sorry, I rewrote that several times -- hopefully it doesn't sound too discombobulated.)

Fidget-Spinner commented 4 months ago

I posted the code for what we need in a single frame's reconstruction data. If we go with this proposal, it will be even simpler than that https://github.com/faster-cpython/ideas/issues/653#issuecomment-1960860756

where we only inline calls that contain no uops with the

Or we can inline only fully pure functions.