Interpreter for micro-ops

Fidget-Spinner commented 1 year ago

Continuing from https://github.com/faster-cpython/ideas/discussions/375.

The granularity of micro ops has not been decided yet, and is being deliberated over at https://github.com/faster-cpython/ideas/issues/454.

Over the next 3 months, we plan to experiment with different traces types (short, long) for the tracing interpreter for micro-ops, and compare them. This was after some discussion with Mark. We plan to experiment around to gain some insights.

My current plan:

Instead of generating uops for the level 1 interpreter to interpret, we could generate uops on the fly and only when we generate the traces. This would save us from the dispatch overhead of uops in the standard interpreter, and would allow the optimisations in the tracing/level 2 interpreter to make up for the dispatch overhead.

With the interpreter being generated, we could automate the uop generation and definition.

E.g LOAD_ATTR_METHOD_WITH_VALUES.

inst(LOAD_ATTR_METHOD_WITH_VALUES) {
    uop_inst(GUARD1)
    uop_inst(GUARD2)
    ...
    uop_inst(METHOD_DESCR_FROM_CACHE)
    PUSH(obj)
}

Then for normal ceval.c (the tier 1 interpreter), the uop_inst instructions generate normal C code. A separate uop interpreter for the tier 2 interpreter will be generated from the very same instruction definition. This allows us to not duplicate definitions. We also auto-generate a mapping from normal instructions to uop instructions. At runtime, the tracing interpreter will lookup the translation from said mapping and "lower" the normal instructions to uop instructions.

How does this experiment sound?

gvanrossum commented 1 year ago

I'm not sure if I completely follow. It looks like you are proposing pretty fine-grained uops. Does the tier 2 interpreter have a case (and hence dispatch overhead) for each uop?

It sounds like you'll be doing a lot of work on generate_cases.py. That's fine, and I can help explain mysteries there, but I have to warn that I have plans to continue working on that file as well, so we might end up with merge conflicts. To minimize those you could try to merge certain changes back into the main branch, but that would be a distraction for you and your groupmate.

Fidget-Spinner commented 1 year ago

Does the tier 2 interpreter have a case (and hence dispatch overhead) for each uop?

Yes. The hope is that the optimizations the tier 2 interpreter does on the uops make the dispatch overhead worth it.

gvanrossum commented 1 year ago

Aha, I believe that was also the plan of #454 (or at least the conclusion), and why we got serious about generating the interpreter. I think that it would still be good to make the uops as large as possible, else the interpretation overhead will kill you.

markshannon commented 1 year ago

What @gvanrossum says.

You will want something like:


inst(LOAD_ATTR_INSTANCE_VALUE_TYPE_CHECK) {
    PyObject *owner = TOP();
    PyTypeObject *tp = Py_TYPE(owner);
    _PyAttrCache *cache = (_PyAttrCache *)next_instr;
    uint32_t type_version = read_u32(cache->version);
    assert(type_version != 0);
    DEOPT_IF(tp->tp_version_tag != type_version, LOAD_ATTR);
    assert(tp->tp_dictoffset < 0);
    assert(tp->tp_flags & Py_TPFLAGS_MANAGED_DICT);
}

inst(LOAD_ATTR_INSTANCE_VALUE_REST) {
    PyObject *owner = TOP();
    PyObject *res;
    PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(owner);
    DEOPT_IF(!_PyDictOrValues_IsValues(dorv), LOAD_ATTR);
    res = _PyDictOrValues_GetValues(dorv)->values[cache->index];
    DEOPT_IF(res == NULL, LOAD_ATTR);
    STAT_INC(LOAD_ATTR, hit);
    Py_INCREF(res);
    SET_TOP(NULL);
    STACK_GROW((oparg & 1));
    SET_TOP(res);
    Py_DECREF(owner);
    JUMPBY(INLINE_CACHE_ENTRIES_LOAD_ATTR);
}

LOAD_ATTR_INSTANCE_VALUE = LOAD_ATTR_INSTANCE_VALUE_TYPE_CHECK + LOAD_ATTR_INSTANCE_VALUE_REST;

The reason for this split, is that type inference on the trace can potentially eliminate the type check, but the checks on the instance have to be done.

I wouldn't attempt to eliminate the type checks in your project, but you could usefully count them.

Fidget-Spinner commented 1 year ago

OK. Should I upstream this to CPython? It seems not very intrusive, and we would need it eventually anyways. What I'm thinking of is:

inst(LOAD_ATTR_METHOD_WITH_VALUES) {

LOAD_ATTR_METHOD_WITH_VALUES_PRELUDE:
    PyObject *owner = TOP();
    _PyAttrCache *cache = (_PyAttrCache *)next_instr;

LOAD_ATTR_METHOD_WITH_VALUES_TYPE_CHECK:
    PyTypeObject *tp = Py_TYPE(owner);
    ...

LOAD_ATTR_METHOD_WITH_VALUES_REST:
    PyObject *res;
    PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(owner);
    ...
    PUSH(res)
}

The interpreter generator will then generate a normal instruction for the tier 1 interpreter, while combining PRELUDE + TYPE_CHECK and PRELUDE + REST to form two micro ops.

gvanrossum commented 1 year ago

OK. Should I upstream this to CPython? It seems not very intrusive, and we would need it eventually anyways. What I'm thinking of is: [...]

I'm not sure how you plan to do that. Transferring data from one uop to the next is possible, but not yet very ergonomic. Have you looked through the code generator yet? There used to be an example of how to do this, but Mark got rid of it when he introduced COMPARE_AND_BRANCH. I could update test_generator.py to include an example though.

gvanrossum commented 1 year ago

Have a look at this: https://github.com/gvanrossum/cpython/commit/b60761d0a9ebce0a795eacfae862ab6daff4a88b

gvanrossum commented 1 year ago

What I'm thinking of is: [...]

So after reading your comments in gh-101299 now I finally understand that you mean this as a proposal for new syntax in bytecodes.c. It looks like your convention is to have some labels in the code block of an instruction, perhaps using a naming convention, that identify individual micro-ops (uops). Syntactically that is a fine convention.

But it doesn't allow us to reuse uops easily, does it? The current DSL syntax allows you do do things like

op(UOP1, (input -- tmpA)) {
    ...  // Common start
}
op(UOP2A, (tmpA -- tmpB)) {
    ... // One way to do this part
}
op(UOP2B, (tmpA -- tmpB)) {
    ...  // Another way to do it
}
op(UOP3, (tmpA -- res)) {
    ...  // Common end
}
macro(OPA) = UOP1 + UOP2A + UOP3;
macro(OPB) = UOP1 + UOP2B + UOP3;

Here, the uops can only pass information between them using the stack, but the generator avoid actual push/pop calls -- it just generates temporary C variables (see gh-101299) and I assume the C compiler will optimize that a bit more. (You can also specify a type for such temporaries, and the generator understands this.)

I agree that my approach is more awkward when you have multiple pieces of data that are shared between uops, but your version make sharing invisible to the generator (which doesn't know e.g. that owner is shared).

Happy to see you elaborate your proposal. (If we need to parse C more precisely, I have code written for that in reserve. :-)

Fidget-Spinner commented 1 year ago

I see your point about reusing uops. I think the best of both worlds would be some combination of the two ideas. So something like

inst(LOAD_ATTR_METHOD_WITH_VALUES) {
    // Common prelude
    PyObject *owner = TOP();
    _PyAttrCache *cache = (_PyAttrCache *)next_instr;
    UOP1();
    UOP2();
    PUSH(res)
}

The main problem with having a shared prelude though is that we need to standardise the names of the common variables in all the uops. I don't know how feasible that would be. The other problem is that this would mean UOPs are no longer standalone, and would create a hard dependency between code that is in the prelude, and code that is in the uop. This would hurt the composability of our uops. Unless of course, we specify some sort of prelude dependency for every uop and expose that to the case generator. Then we could feasibly auto-generate these.

gvanrossum commented 1 year ago

That inst(LOAD_ATTR_METHOD_WITH_VALUES) syntax you're using/seeing here is the "legacy" format, without the stack effects of the instruction. We're trying to convert everything to the new format, so we can remove support for the legacy format in the DSL grammar.

LOAD_ATTR in particular requires a bit of new DSL syntax that I'm currently working on, so we can define it as

inst(LOAD_ATTR, (counter/1, unused/8, owner, -- thing1 if (oparg & 1), thing2) {
    ...
    if (oparg & 1) {
        ...
        if (...) {
            thing1 = meth;
            thing2 = owner;
        }
        else {
            thing1 = NULL;
            thing2 = meth;
        }
    }
    else {
        ...
        thing2 = res;
    }
}

That looks awkward because the meth object isn't always in the same place on the stack, although it echoes the complexity of the instruction's semantics (I had to read that code three times before I figured what it was doing). We need the I/O specification so we can auto-generate the prologue and epilogue, and it also lets generator do the right thing for macro instructions.

When the instruction definition specifies I/O effects, a prologue (prelude) is auto-generated, which takes care of the pushes at the beginning and (in the epilogue) the pops at the end, including the necessary variable definitions. For examples, just look through generated_cases.c.h for some opcodes that use the new format, e.g. BINARY_SUBSCR.

It would seem attractive to make this less complicated, but I worry about where UOP1() and UOP2() from your example are defined (are they just op(UOP1, (... -- ...)) { ... } ?) and I fear that the naming conventions would be difficult to maintain. Maybe we can pass variable names explicitly into the calls, e.g. UOP(a, b, c)?

Fidget-Spinner commented 1 year ago

Yeah UOP1 and UOP2 are just macros. They expand to the following:

For the tier 1 interpreter, just normal C code.
For tier 2 interpreter, 2 separate instructions. Then the overall macro LOAD_ATTR disappears.

(are they just op(UOP1, (... -- ...)) { ... } ?)

Yes. They should indeed be defined like that.

markshannon commented 1 year ago

The code generator can almost handle micro-ops now. Hopefully it won't be much work to support them. E.g I can write BINARY_OP_ADD_FLOAT as this superinstruction:

        op(_BINARY_OP_FLOAT_CHECK, (left, right -- a, b)) {
            DEOPT_IF(!PyFloat_CheckExact(left), BINARY_OP);
            DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
            STAT_INC(BINARY_OP, hit);
        }

        op(_BINARY_OP_ADD_FLOAT_ACTION, (unused/1, a, b -- sum)) {
            double dsum = ((PyFloatObject *)a)->ob_fval +
                ((PyFloatObject *)b)->ob_fval;
            sum = PyFloat_FromDouble(dsum);
            _Py_DECREF_SPECIALIZED(b, _PyFloat_ExactDealloc);
            _Py_DECREF_SPECIALIZED(a, _PyFloat_ExactDealloc);
            ERROR_IF(sum == NULL, error);
        }

        super(BINARY_OP_ADD_FLOAT) = _BINARY_OP_FLOAT_CHECK + _BINARY_OP_ADD_FLOAT_ACTION;

But this generates almost correct code (it has an extra NEXTOPARG(); JUMPBY(1); in the middle that we don't want.

      TARGET(BINARY_OP_ADD_FLOAT) {
            PyObject *_tmp_1 = PEEK(1);
            PyObject *_tmp_2 = PEEK(2);
            {
                PyObject *right = _tmp_1;
                PyObject *left = _tmp_2;
                PyObject *a;
                PyObject *b;
                assert(cframe.use_tracing == 0);
                DEOPT_IF(!PyFloat_CheckExact(left), BINARY_OP);
                DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
                STAT_INC(BINARY_OP, hit);
                _tmp_2 = a;
                _tmp_1 = b;
            }
            NEXTOPARG();
            JUMPBY(1);
            {
                PyObject *b = _tmp_1;
                PyObject *a = _tmp_2;
                PyObject *sum;
                assert(cframe.use_tracing == 0);
                double dsum = ((PyFloatObject *)a)->ob_fval +
                    ((PyFloatObject *)b)->ob_fval;
                sum = PyFloat_FromDouble(dsum);
                _Py_DECREF_SPECIALIZED(b, _PyFloat_ExactDealloc);
                _Py_DECREF_SPECIALIZED(a, _PyFloat_ExactDealloc);
                if (sum == NULL) goto pop_2_error;
                _tmp_2 = sum;
            }
            JUMPBY(1);
            STACK_SHRINK(1);
            POKE(1, _tmp_2);
            DISPATCH();
        }

To gets this working we would need:

Support for the form inst(BINARY_OP_ADD_FLOAT) = _BINARY_OP_FLOAT_CHECK + _BINARY_OP_ADD_FLOAT_ACTION; (which would be the same as super but without the NEXTOPARG(); JUMPBY(1); between the parts)
Checks that the part of a superinstructions were instructions
Checks that part of an instruction were ops.

gvanrossum commented 1 year ago

@markshannon

I can write BINARY_OP_ADD_FLOAT as this superinstruction: [...]

Use macro instead of super, then it should work.

gvanrossum commented 1 year ago

@Fidget-Spinner

Yeah UOP1 and UOP2 are just macros. They expand to the following:
1. For the tier 1 interpreter, just normal C code.

2. For tier 2 interpreter, 2 separate instructions. Then the overall macro LOAD_ATTR disappears.
(are they just op(UOP1, (... -- ...)) { ... } ?)

Yes. They should indeed be defined like that.

Hm, that would require the tier 1 generator (the only one we currently have :-) to generate a macro in its output from an op definition, right: so

inst(OP, (a, b -- c, d)) {
    spam(a, b);
    c = ...;
    d = ...;
}

would be translated into

#define OP() { \
    spam(a, b); \
    c = ...; \
    d = ...; \
}

Honestly that feels pretty fragile, though for a quick prototype it should do.

markshannon commented 1 year ago

Use macro instead of super, then it should work.

That doesn't work either. I created an issue for it https://github.com/python/cpython/issues/101369

Fidget-Spinner commented 1 year ago

OK. I've hacked around and this is what I have: https://github.com/Fidget-Spinner/cpython/commit/5b38288fcd08a8fc5c1a1503577e20d37cfb10f7

Look for the two micro instructions: "BINARY_OP_ADD_INT_TYPE_CHECK" and "BINARY_OP_ADD_INST_REST" (typo). In the tier 1 interpreter (in Python/generated_cases.c.h) they are just macros. I've already verified that it compiles and can run CPython's test suite. They share a common prelude.

In the tier 2 interpreter (in Python/generated_cases_tier2.c.h) they are actual instructions, and the macro instruction BINARY_OP_ADD_INT disappears. All other unconverted instructions are preserved as a fallback.

At runtime, when we want to switch between the interpreters, we just swap out the _PyEval_EvalFrameDefault function with another function with the tier 2 interpreter definitions. So everything stays nicely separated.

gvanrossum commented 1 year ago

Hi Ken Jin, I'm curious why you decided to go with new syntax e.g. macro_inst instead of using the existing macro syntax? You could just write

macro(BINARY_OP_ADD_INT) = unused/1 + BINARY_OP_ADD_INT_TYPE_CHECK + BINARY_OP_ADD_INST_REST;

and use op instead of u_inst for the micro instruction definitions.

Fidget-Spinner commented 1 year ago

No particular reason. It's not upstream-ready code. Just a quick experiment. I'll update you on what we've done during our meeting later :).

Fidget-Spinner commented 1 year ago

Closing this issue. Will open a superseding one with more concrete plans once my experiment concludes.

faster-cpython / ideas

Interpreter for micro-ops #541