faster-cpython / ideas

1.67k stars 49 forks source link

How to handle YIELD_VALUE in Tier 2 #628

Open gvanrossum opened 9 months ago

gvanrossum commented 9 months ago

YIELD_VALUE is currently the most-executed Tier 1 instruction that doesn't have a translation into Tier 2 uops.

But how should we handle it?

yield is highly asymmetric -- the generator's frame is entered (resumed) by calling __next__(), send() or throw(), but it is left through the YIELD_VALUE bytecode. We don't specialize any of those calls, so we can't really do something similar to _PUSH_FRAME on the call (yet -- if we can specialize list.append presumably we can specialize generator.__next__).

It might be possible to trace through YIELD_VALUE in a way similar to _POP_FRAME, if we know we'll always yield to the same frame, and if we know the code running in that frame. This would only make sense if we also traced through __next__() or send() though, otherwise we'd quickly run out of frames.

Concluding, I don't see what to do for YIELD_VALUE in Tier 2. Am I missing something, @markshannon or @brandtbucher?

markshannon commented 9 months ago

According to the stats:

Instruction Count
YIELD_VALUE 958M
SEND_GEN 490M
FOR_ITER_GEN 113M

So 63% of YIELD_VALUES have a matching send/for. I suspect the remaining 37% are the result of iterating over a generator from C code, which we want to fix anyway.

I expect that gen.__next__() is almost never called explicitly, throw()s are very rare, and explicit send() calls can be handled by reimplementing send in bytecode.

Given that, YIELD_VALUE can be treated in much the same way as RETURN_VALUE, with SEND_GEN and FOR_ITER_GEN being treated as calls.

It is SEND_GEN and FOR_ITER_GEN that are the challenge to handle as generators may have several entry points.

gvanrossum commented 9 months ago

Okay, I'll educate myself about SEND_GEN and FOR_ITER_GEN. I suspect that method __next__() is almost never called directly, but the builtin next() is called plenty of times, so we might translate that into bytecode too. throw() is unimportant.

gvanrossum commented 9 months ago

I am still stuck on this. E.g. how can we identify the code object of the generator while tracing through a for-loop starting with _FOR_ITER?

Meanwhile, it looks like _PUSH_FRAME is very similar to what happens at DISPATCH_INLINED() and the start_frame label in ceval.c, with only a few differences:

gvanrossum commented 9 months ago

E.g. how can we identify the code object of the generator while tracing through a for-loop starting with _FOR_ITER?

A few possible options exist:

There are other complications though.

Another random idea:

Mark and I are hoping to dive more deeply into this at the sprint in Brno.

gvanrossum commented 8 months ago

This is still very tricky; we decided that we have other priorities, e.g. trace stitching, for which we first need to merge the Tier 2 interpreter into the Tier 1 interpreter (https://github.com/faster-cpython/ideas/issues/631).

markshannon commented 8 months ago

The basic problem is that we don't have the runtime information necessary to project the trace until we execute the code. This is a solved problem. Psyco handles this by stopping compilation, running the code and start optimizing again with the new information. We can do the same.

The idea is that we exit the trace with a special micro-op that resumes optimization, passing in the actual value seen at runtime.

It is conceptually a bit complex, but not that hard to implement.

This approach also works for calls and returns, so we could throw out the function cache, which doesn't work well for closures anyway.