Closed Fidget-Spinner closed 1 year ago
I'm not sure if I completely follow. It looks like you are proposing pretty fine-grained uops. Does the tier 2 interpreter have a case (and hence dispatch overhead) for each uop?
It sounds like you'll be doing a lot of work on generate_cases.py. That's fine, and I can help explain mysteries there, but I have to warn that I have plans to continue working on that file as well, so we might end up with merge conflicts. To minimize those you could try to merge certain changes back into the main branch, but that would be a distraction for you and your groupmate.
Does the tier 2 interpreter have a case (and hence dispatch overhead) for each uop?
Yes. The hope is that the optimizations the tier 2 interpreter does on the uops make the dispatch overhead worth it.
Aha, I believe that was also the plan of #454 (or at least the conclusion), and why we got serious about generating the interpreter. I think that it would still be good to make the uops as large as possible, else the interpretation overhead will kill you.
What @gvanrossum says.
You will want something like:
inst(LOAD_ATTR_INSTANCE_VALUE_TYPE_CHECK) {
PyObject *owner = TOP();
PyTypeObject *tp = Py_TYPE(owner);
_PyAttrCache *cache = (_PyAttrCache *)next_instr;
uint32_t type_version = read_u32(cache->version);
assert(type_version != 0);
DEOPT_IF(tp->tp_version_tag != type_version, LOAD_ATTR);
assert(tp->tp_dictoffset < 0);
assert(tp->tp_flags & Py_TPFLAGS_MANAGED_DICT);
}
inst(LOAD_ATTR_INSTANCE_VALUE_REST) {
PyObject *owner = TOP();
PyObject *res;
PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(owner);
DEOPT_IF(!_PyDictOrValues_IsValues(dorv), LOAD_ATTR);
res = _PyDictOrValues_GetValues(dorv)->values[cache->index];
DEOPT_IF(res == NULL, LOAD_ATTR);
STAT_INC(LOAD_ATTR, hit);
Py_INCREF(res);
SET_TOP(NULL);
STACK_GROW((oparg & 1));
SET_TOP(res);
Py_DECREF(owner);
JUMPBY(INLINE_CACHE_ENTRIES_LOAD_ATTR);
}
LOAD_ATTR_INSTANCE_VALUE = LOAD_ATTR_INSTANCE_VALUE_TYPE_CHECK + LOAD_ATTR_INSTANCE_VALUE_REST;
The reason for this split, is that type inference on the trace can potentially eliminate the type check, but the checks on the instance have to be done.
I wouldn't attempt to eliminate the type checks in your project, but you could usefully count them.
OK. Should I upstream this to CPython? It seems not very intrusive, and we would need it eventually anyways. What I'm thinking of is:
inst(LOAD_ATTR_METHOD_WITH_VALUES) {
LOAD_ATTR_METHOD_WITH_VALUES_PRELUDE:
PyObject *owner = TOP();
_PyAttrCache *cache = (_PyAttrCache *)next_instr;
LOAD_ATTR_METHOD_WITH_VALUES_TYPE_CHECK:
PyTypeObject *tp = Py_TYPE(owner);
...
LOAD_ATTR_METHOD_WITH_VALUES_REST:
PyObject *res;
PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(owner);
...
PUSH(res)
}
The interpreter generator will then generate a normal instruction for the tier 1 interpreter, while combining PRELUDE + TYPE_CHECK and PRELUDE + REST to form two micro ops.
OK. Should I upstream this to CPython? It seems not very intrusive, and we would need it eventually anyways. What I'm thinking of is: [...]
I'm not sure how you plan to do that. Transferring data from one uop to the next is possible, but not yet very ergonomic. Have you looked through the code generator yet? There used to be an example of how to do this, but Mark got rid of it when he introduced COMPARE_AND_BRANCH. I could update test_generator.py to include an example though.
What I'm thinking of is: [...]
So after reading your comments in gh-101299 now I finally understand that you mean this as a proposal for new syntax in bytecodes.c. It looks like your convention is to have some labels in the code block of an instruction, perhaps using a naming convention, that identify individual micro-ops (uops). Syntactically that is a fine convention.
But it doesn't allow us to reuse uops easily, does it? The current DSL syntax allows you do do things like
op(UOP1, (input -- tmpA)) {
... // Common start
}
op(UOP2A, (tmpA -- tmpB)) {
... // One way to do this part
}
op(UOP2B, (tmpA -- tmpB)) {
... // Another way to do it
}
op(UOP3, (tmpA -- res)) {
... // Common end
}
macro(OPA) = UOP1 + UOP2A + UOP3;
macro(OPB) = UOP1 + UOP2B + UOP3;
Here, the uops can only pass information between them using the stack, but the generator avoid actual push/pop calls -- it just generates temporary C variables (see gh-101299) and I assume the C compiler will optimize that a bit more. (You can also specify a type for such temporaries, and the generator understands this.)
I agree that my approach is more awkward when you have multiple pieces of data that are shared between uops, but your version make sharing invisible to the generator (which doesn't know e.g. that owner
is shared).
Happy to see you elaborate your proposal. (If we need to parse C more precisely, I have code written for that in reserve. :-)
I see your point about reusing uops. I think the best of both worlds would be some combination of the two ideas. So something like
inst(LOAD_ATTR_METHOD_WITH_VALUES) {
// Common prelude
PyObject *owner = TOP();
_PyAttrCache *cache = (_PyAttrCache *)next_instr;
UOP1();
UOP2();
PUSH(res)
}
The main problem with having a shared prelude though is that we need to standardise the names of the common variables in all the uops. I don't know how feasible that would be. The other problem is that this would mean UOPs are no longer standalone, and would create a hard dependency between code that is in the prelude, and code that is in the uop. This would hurt the composability of our uops. Unless of course, we specify some sort of prelude dependency for every uop and expose that to the case generator. Then we could feasibly auto-generate these.
That inst(LOAD_ATTR_METHOD_WITH_VALUES)
syntax you're using/seeing here is the "legacy" format, without the stack effects of the instruction. We're trying to convert everything to the new format, so we can remove support for the legacy format in the DSL grammar.
LOAD_ATTR in particular requires a bit of new DSL syntax that I'm currently working on, so we can define it as
inst(LOAD_ATTR, (counter/1, unused/8, owner, -- thing1 if (oparg & 1), thing2) {
...
if (oparg & 1) {
...
if (...) {
thing1 = meth;
thing2 = owner;
}
else {
thing1 = NULL;
thing2 = meth;
}
}
else {
...
thing2 = res;
}
}
That looks awkward because the meth
object isn't always in the same place on the stack, although it echoes the complexity of the instruction's semantics (I had to read that code three times before I figured what it was doing). We need the I/O specification so we can auto-generate the prologue and epilogue, and it also lets generator do the right thing for macro instructions.
When the instruction definition specifies I/O effects, a prologue (prelude) is auto-generated, which takes care of the pushes at the beginning and (in the epilogue) the pops at the end, including the necessary variable definitions. For examples, just look through generated_cases.c.h for some opcodes that use the new format, e.g. BINARY_SUBSCR.
It would seem attractive to make this less complicated, but I worry about where UOP1() and UOP2() from your example are defined (are they just op(UOP1, (... -- ...)) { ... }
?) and I fear that the naming conventions would be difficult to maintain. Maybe we can pass variable names explicitly into the calls, e.g. UOP(a, b, c)?
Yeah UOP1 and UOP2 are just macros. They expand to the following:
(are they just op(UOP1, (... -- ...)) { ... } ?)
Yes. They should indeed be defined like that.
The code generator can almost handle micro-ops now.
Hopefully it won't be much work to support them.
E.g
I can write BINARY_OP_ADD_FLOAT
as this superinstruction:
op(_BINARY_OP_FLOAT_CHECK, (left, right -- a, b)) {
DEOPT_IF(!PyFloat_CheckExact(left), BINARY_OP);
DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
STAT_INC(BINARY_OP, hit);
}
op(_BINARY_OP_ADD_FLOAT_ACTION, (unused/1, a, b -- sum)) {
double dsum = ((PyFloatObject *)a)->ob_fval +
((PyFloatObject *)b)->ob_fval;
sum = PyFloat_FromDouble(dsum);
_Py_DECREF_SPECIALIZED(b, _PyFloat_ExactDealloc);
_Py_DECREF_SPECIALIZED(a, _PyFloat_ExactDealloc);
ERROR_IF(sum == NULL, error);
}
super(BINARY_OP_ADD_FLOAT) = _BINARY_OP_FLOAT_CHECK + _BINARY_OP_ADD_FLOAT_ACTION;
But this generates almost correct code (it has an extra NEXTOPARG(); JUMPBY(1);
in the middle that we don't want.
TARGET(BINARY_OP_ADD_FLOAT) {
PyObject *_tmp_1 = PEEK(1);
PyObject *_tmp_2 = PEEK(2);
{
PyObject *right = _tmp_1;
PyObject *left = _tmp_2;
PyObject *a;
PyObject *b;
assert(cframe.use_tracing == 0);
DEOPT_IF(!PyFloat_CheckExact(left), BINARY_OP);
DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
STAT_INC(BINARY_OP, hit);
_tmp_2 = a;
_tmp_1 = b;
}
NEXTOPARG();
JUMPBY(1);
{
PyObject *b = _tmp_1;
PyObject *a = _tmp_2;
PyObject *sum;
assert(cframe.use_tracing == 0);
double dsum = ((PyFloatObject *)a)->ob_fval +
((PyFloatObject *)b)->ob_fval;
sum = PyFloat_FromDouble(dsum);
_Py_DECREF_SPECIALIZED(b, _PyFloat_ExactDealloc);
_Py_DECREF_SPECIALIZED(a, _PyFloat_ExactDealloc);
if (sum == NULL) goto pop_2_error;
_tmp_2 = sum;
}
JUMPBY(1);
STACK_SHRINK(1);
POKE(1, _tmp_2);
DISPATCH();
}
To gets this working we would need:
inst(BINARY_OP_ADD_FLOAT) = _BINARY_OP_FLOAT_CHECK + _BINARY_OP_ADD_FLOAT_ACTION;
(which would be the same as super
but without the NEXTOPARG(); JUMPBY(1);
between the parts)@markshannon
I can write BINARY_OP_ADD_FLOAT as this superinstruction: [...]
Use macro
instead of super
, then it should work.
@Fidget-Spinner
Yeah UOP1 and UOP2 are just macros. They expand to the following:
1. For the tier 1 interpreter, just normal C code. 2. For tier 2 interpreter, 2 separate instructions. Then the overall macro LOAD_ATTR disappears.
(are they just op(UOP1, (... -- ...)) { ... } ?)
Yes. They should indeed be defined like that.
Hm, that would require the tier 1 generator (the only one we currently have :-) to generate a macro in its output from an op
definition, right: so
inst(OP, (a, b -- c, d)) {
spam(a, b);
c = ...;
d = ...;
}
would be translated into
#define OP() { \
spam(a, b); \
c = ...; \
d = ...; \
}
Honestly that feels pretty fragile, though for a quick prototype it should do.
Use macro instead of super, then it should work.
That doesn't work either. I created an issue for it https://github.com/python/cpython/issues/101369
OK. I've hacked around and this is what I have: https://github.com/Fidget-Spinner/cpython/commit/5b38288fcd08a8fc5c1a1503577e20d37cfb10f7
Look for the two micro instructions: "BINARY_OP_ADD_INT_TYPE_CHECK" and "BINARY_OP_ADD_INST_REST" (typo).
In the tier 1 interpreter (in Python/generated_cases.c.h
) they are just macros. I've already verified that it compiles and can run CPython's test suite. They share a common prelude.
In the tier 2 interpreter (in Python/generated_cases_tier2.c.h
) they are actual instructions, and the macro instruction BINARY_OP_ADD_INT
disappears. All other unconverted instructions are preserved as a fallback.
At runtime, when we want to switch between the interpreters, we just swap out the _PyEval_EvalFrameDefault
function with another function with the tier 2 interpreter definitions. So everything stays nicely separated.
Hi Ken Jin, I'm curious why you decided to go with new syntax e.g. macro_inst
instead of using the existing macro
syntax? You could just write
macro(BINARY_OP_ADD_INT) = unused/1 + BINARY_OP_ADD_INT_TYPE_CHECK + BINARY_OP_ADD_INST_REST;
and use op
instead of u_inst
for the micro instruction definitions.
No particular reason. It's not upstream-ready code. Just a quick experiment. I'll update you on what we've done during our meeting later :).
Closing this issue. Will open a superseding one with more concrete plans once my experiment concludes.
Continuing from https://github.com/faster-cpython/ideas/discussions/375.
The granularity of micro ops has not been decided yet, and is being deliberated over at https://github.com/faster-cpython/ideas/issues/454.
Over the next 3 months, we plan to experiment with different traces types (short, long) for the tracing interpreter for micro-ops, and compare them. This was after some discussion with Mark. We plan to experiment around to gain some insights.
My current plan:
Instead of generating uops for the level 1 interpreter to interpret, we could generate uops on the fly and only when we generate the traces. This would save us from the dispatch overhead of uops in the standard interpreter, and would allow the optimisations in the tracing/level 2 interpreter to make up for the dispatch overhead.
With the interpreter being generated, we could automate the uop generation and definition.
E.g
LOAD_ATTR_METHOD_WITH_VALUES
.Then for normal
ceval.c
(the tier 1 interpreter), theuop_inst
instructions generate normal C code. A separate uop interpreter for the tier 2 interpreter will be generated from the very same instruction definition. This allows us to not duplicate definitions. We also auto-generate a mapping from normal instructions to uop instructions. At runtime, the tracing interpreter will lookup the translation from said mapping and "lower" the normal instructions to uop instructions.How does this experiment sound?