Open gvanrossum opened 9 months ago
According to the stats:
Instruction | Count |
---|---|
YIELD_VALUE |
958M |
SEND_GEN |
490M |
FOR_ITER_GEN |
113M |
So 63% of YIELD_VALUES
have a matching send/for.
I suspect the remaining 37% are the result of iterating over a generator from C code, which we want to fix anyway.
I expect that gen.__next__()
is almost never called explicitly, throw()
s are very rare, and explicit send()
calls can be handled by reimplementing send
in bytecode.
Given that, YIELD_VALUE
can be treated in much the same way as RETURN_VALUE
, with SEND_GEN
and FOR_ITER_GEN
being treated as calls.
It is SEND_GEN
and FOR_ITER_GEN
that are the challenge to handle as generators may have several entry points.
Okay, I'll educate myself about SEND_GEN
and FOR_ITER_GEN
. I suspect that method __next__()
is almost never called directly, but the builtin next()
is called plenty of times, so we might translate that into bytecode too. throw()
is unimportant.
I am still stuck on this. E.g. how can we identify the code object of the generator while tracing through a for-loop starting with _FOR_ITER
?
Meanwhile, it looks like _PUSH_FRAME
is very similar to what happens at DISPATCH_INLINED()
and the start_frame
label in ceval.c, with only a few differences:
_PUSH_FRAME
starts with frame->return_offset = 0;
while this is set explicitly before calling DISPATCH_INLINED()
elsewhere._PUSH_FRAME
omits setting
frame->prev_instr = next_instr - 1`; presumably because it is already set correctly? (And because it wouldn't work in Tier 2?)DISPATCH_INLINED()
calls _Py_EnterRecursivePy()
where _PUSH_FRAME
merely decrements tstate->py_recursion_remaining
. Presumably _PUSH_FRAME
does this for speed, and maybe because those functions don't exist in the Tier 2 interpreter?E.g. how can we identify the code object of the generator while tracing through a for-loop starting with _FOR_ITER?
A few possible options exist:
_PyOptimizer_BackEdge()
has the frame and the stack pointer, although they are currently not passed through to uop_optimize()
).FOR_ITER
cache structure to include a function version field, which we can use similar to we find it for tracing through CALL_PY_EXACT_ARGS
.There are other complications though.
yield
statements, subsequent iterations of FOR_ITER_GEN
may not continue at the same point in the generator's code object -- unlike CALL
, where we always enter from the top. So we can't blindly trace through it even if we know the code object and where to resume it!
yield
(intuitively these are common). There is still the initial resume, which still enters from the top (after RETURN_GENERATOR
), but this handled by the Tier 1 interpreter anyway, because the Tier 2 executor is only triggered by the ENTER_EXECUTOR
at the bottom of the loop.Another random idea:
yield
(i.e., YIELD_VALUE
) trace into the caller. This is quite exciting, though possibly the required frame stack manipulations are too complicated.Mark and I are hoping to dive more deeply into this at the sprint in Brno.
This is still very tricky; we decided that we have other priorities, e.g. trace stitching, for which we first need to merge the Tier 2 interpreter into the Tier 1 interpreter (https://github.com/faster-cpython/ideas/issues/631).
The basic problem is that we don't have the runtime information necessary to project the trace until we execute the code. This is a solved problem. Psyco handles this by stopping compilation, running the code and start optimizing again with the new information. We can do the same.
The idea is that we exit the trace with a special micro-op that resumes optimization, passing in the actual value seen at runtime.
It is conceptually a bit complex, but not that hard to implement.
This approach also works for calls and returns, so we could throw out the function cache, which doesn't work well for closures anyway.
YIELD_VALUE is currently the most-executed Tier 1 instruction that doesn't have a translation into Tier 2 uops.
But how should we handle it?
yield
is highly asymmetric -- the generator's frame is entered (resumed) by calling__next__()
,send()
orthrow()
, but it is left through theYIELD_VALUE
bytecode. We don't specialize any of those calls, so we can't really do something similar to_PUSH_FRAME
on the call (yet -- if we can specializelist.append
presumably we can specializegenerator.__next__
).It might be possible to trace through
YIELD_VALUE
in a way similar to_POP_FRAME
, if we know we'll always yield to the same frame, and if we know the code running in that frame. This would only make sense if we also traced through__next__()
orsend()
though, otherwise we'd quickly run out of frames.Concluding, I don't see what to do for
YIELD_VALUE
in Tier 2. Am I missing something, @markshannon or @brandtbucher?