faster-cpython / ideas

1.67k stars 49 forks source link

WIP: Extend the interpreter DSL to assist in optimization passes and even codegen #611

Open Fidget-Spinner opened 11 months ago

Fidget-Spinner commented 11 months ago

The current rough proposal for optimization passes is to use abstract interpretation on tier 2 uops. This makes the interpreter DSL a natural way of describing the abstract interpretation. We could extend the interpreter DSL to do two additional things:

  1. Feed information to our optimization passes. This reduces maintenance burden for the analysis phase.
  2. Help the optimization passes make sound decisions. This reduces maintenance burden for the codegen phase.

1 Here's the _BINARY_OP_MULTIPLY_INT uop, annotated with type annotation information for the output:

        op(_BINARY_OP_MULTIPLY_INT, (unused/1, left, right -- res: &PyLong_Type)) {
            STAT_INC(BINARY_OP, hit);
            res = _PyLong_Multiply((PyLongObject *)left, (PyLongObject *)right);
            _Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
            _Py_DECREF_SPECIALIZED(left, (destructor)PyObject_Free);
            ERROR_IF(res == NULL, error);
        }

Which would potentially generate the following target for type propagation:

        TARGET(_BINARY_OP_MULTIPLY_INT) {
            _Py_TYPENODE_t *left = TYPESTACK_PEEK(2); // int
            STACK_SHRINK(1);
            TYPE_SET(left, TYPESTACK_PEEK(1), false); // int
            break;
        }

We plan to further experiment with this to see if we can feed information required for partial evaluation. E.g.

        // @pure
        op(_BINARY_OP_MULTIPLY_INT, (unused/1, left, right -- res: &PyLong_Type)) {
            ...
        }

Would mark that this operation is "pure", and that it does not affect the static/dynamic split during partial evaluation.

2 The conversion from BINARY_OP to _BINARY_OP_MULTIPLY_INT is fairly simple. The predicate is that oparg=NB_MULTIPLY and type(left)=type(right)=int. This could be how we can annotate that in the DSL:

        // @autocodegen
        // @codegen_predicate(oparg=NB_MULTIPLY)
        // @pure
        op(_BINARY_OP_MULTIPLY_INT, (unused/1, left: &PyLong_Type, right: &PyLong_Type -- res: &PyLong_Type)) {
        }

Note that we have type annotated the input types. This tells the type propagator what types we expect the stack variables to be. Note that the code generator already knows that BINARY_OP maps to BINARY_OP_ADD_INT thanks to the family tag in the DSL!

Fidget-Spinner commented 11 months ago

Relevant paper, as we are partially evaluating CPython bytecode, which is closer to machine code than AST/CFG.

https://research.cs.wisc.edu/wpis/papers/oopsla15.pdf

(this is for further optimizations later)

iritkatriel commented 11 months ago

Could this extend to the inputs? So if you write

op(_BINARY_OP_MULTIPLY_INT, (unused/1, left: &PyLong_Type, right : &PyLong_Type -- res: &PyLong_Type))

then the generator emits type guards?

Fidget-Spinner commented 11 months ago

then the generator emits type guards?

Yep that's the idea. You can see the finite state machine we used in our previous project here https://github.com/pylbbv/pylbbv/blob/pylbbv/Python/tier2.c#L1702. That's roughly what we want to generate, without handwriting it of course.

gvanrossum commented 11 months ago

FWIW, I'd rather not start using a syntax based on // @... comments. If we want these things in the DSL, let's modify the DSL parser to recognize them as proper syntax. Maybe this?

        op(_BINARY_OP_MULTIPLY_INT,
           (unused/1, left: &PyLong_Type, right: &PyLong_Type -- res: &PyLong_Type),
           flags=[autocodegen, codegen_predicate(oparg=NB_MULTIPLY), pure])
        {
        }
Fidget-Spinner commented 11 months ago

That LGTM too. In the parser implementation, I would probably add support for trailing commas, so we can have nice PEP8-style lists.

        op(_BINARY_OP_MULTIPLY_INT,
           (unused/1, left: &PyLong_Type, right: &PyLong_Type -- res: &PyLong_Type),
               flags=[
                   autocodegen,
                   codegen_predicate(oparg=NB_MULTIPLY),
                   pure,
               ]
           )
        {
        }

It also makes the attributes easier to read.

Fidget-Spinner commented 11 months ago

Also up for discussion: how to annotate the partition effects. Right now we have two effects: OVERWRITE (set a node) and SET (set partition's value). In our tier 2 interpreter report, this was the EBNF we proposed:

local_effect := "locals" "[" expr "]" "=" local_source
local_source := localsrc_literal | localsrc_stackinput
localsrc_literal := identifier
localsrc_stackinput := "*" identifier

type_annotation := ("{" type_operation ( "," type_operation )... "}" )
| type_operation
type_operation := type_set | type_overwrite
type_set := "<<=" type_source
type_overwrite := type_source

type_source := typesrc_literal
| typesrc_locals
| typesrc_const
| typesrc_stackinput
typesrc_literal := identifier
typesrc_locals := "locals" "[" expr "]"
typesrc_consts := "consts" "[" expr "]"
typesrc_stackinput := "*" identifier

Here are some sample instructions from our project:

        // Node source. Takes inspiration from dereference in C.
        inst(STORE_FAST, (value --), locals[oparg] = *value) {
        }

        // Single partition operation.
        inst(CHECK_INT, (maybe_int, unused[oparg] -- maybe_int : <<= PyLong_Type, unused[oparg])) {
        }

        // An example of multiple partition operations in a single instruction.
        inst(CHECK_FLOAT, (maybe_float, unused[oparg] -- unboxed_float : {<<= PyFloat_Type, PyRawFloat_Type}, unused[oparg])) {
        // Checks float and also unboxes it onto the stack.
        }

I was wondering how you feel about the partition operations being described with *, <<= and {}?

gvanrossum commented 11 months ago

I'm afraid I didn't read your report in enough detail, I have no idea what partitions are or what they are used for! Sorry. :-(

In the first example, why is the * needed? If we think of locals[oparg] = *value as C code, the * feels wrong, since locals looks to me like an array of PyObject* and value is a single PyObject*. So in C I'd just write locals[oparg] = value. I guess this notation is meant to describe the semantics, at least for simple opcodes, instead of writing actual C code?

What is the <<= operator supposed to do? I understand it distinguishes between two different ways a type can be applied to an I/O effect, but I don't understand the difference yet. What mathematical operation is it trying to resemble?

I guess {} are just used for grouping multiple types -- are these unions or intersections? It seems a reasonable grouping character. I presume {foo} and foo mean the same thing -- why not () instead of {}?

Fidget-Spinner commented 11 months ago

I'm afraid I didn't read your report in enough detail, I have no idea what partitions are or what they are used for! Sorry. :-(

No worries! It was admittedly quite long. Just a quick recap, our analysis algorithm partitions locals and stack entries into subsets that represent their relationship. Each partition is represented as a tree, where each stack/local entry is a node and the root node holds the overall type/constant value of the entire partition.

In the first example, why is the * needed? If we think of locals[oparg] = *value as C code, the * feels wrong, since locals looks to me like an array of PyObject* and value is a single PyObject*. So in C I'd just write locals[oparg] = value. I guess this notation is meant to describe the semantics, at least for simple opcodes, instead of writing actual C code?

Yes it's meant to describe that "the node at locals[oparg] is overwritten/assigned the value of the node value"

What is the <<= operator supposed to do? I understand it distinguishes between two different ways a type can be applied to an I/O effect, but I don't understand the difference yet. What mathematical operation is it trying to resemble?

It is the SET operation, which set the entire partition's type by updating the root node. In the case above of CHECK_INT, the partition the stack variable is in has its type set as PyLong_Type. The default operation for a normal type annotation is OVERWRITE. So we will need other syntax to distinguish that since the current DSL already uses type annotations for casting.

I guess {} are just used for grouping multiple types -- are these unions or intersections? It seems a reasonable grouping character. I presume {foo} and foo mean the same thing -- why not () instead of {}?

They are used to indicate multiple type operations. Indeed I also considered just plain parantheses, and I don't mind either.

gvanrossum commented 10 months ago

I still owe you a reading of your report and the Oopsla paper you linked to. If this ever becomes urgent please ping me and I'll drop other stuff.

Fidget-Spinner commented 10 months ago

Thanks Guido. You don't have to read the report. I will create some slides to explain the algorithm and also the DSL changes to save everyone's time.

gvanrossum commented 10 months ago

Having worked through your preso I think I understand a bit better, but it would be useful to work through some examples, starting with the Python code. For my own education, let me try one for each operation.

Let's start with

x = 1
y = x

In Tier 1 bytecode, that's

LOAD_CONST (1)
STORE_FAST (x)
LOAD_FAST (x)
STORE_FAST (y)

In Tier 2 it's the same, except for added SAVE_IP uops, which we'll ignore, and there's an EXIT_TRACE at the end.

Walking through these one at a time:

LOAD_CONST (1) Creates a "root node", which I understand to be a place to hold a value (here, the integer 1). Then it makes stack[0] point to that node. (Are there non-root nodes? Somehow I only recall reading about root node objects.)

STORE_FAST (x) Makes the variable x point to that same root node. (As an intermediate step it seems it is made to point to stack[0] but since that is immediately popped I'm not sure what the point is.)

LOAD_FAST(x) Makes stack[0] point to x (which points to the root node 1).

STORE_FAST(y) Makes the variable y point to stack[0], and then (as the latter is popped) to x instead. So now we have a chain y -> x -> root(1).

I'm not sure at what point the concrete uops are emitted from this. Possibly for each STORE_FAST, since that has an observable non-stack effect? Or when we encounter the EXIT_TRACE uop?

Anyway, in this example it seems everything is considered static, since everything ends up pointing to that root node. In the next example I try to do something more interesting:

global a
x = a
y = 1
z = x + y

Now x is dynamic and y is static, so when we get to x + y we have x on the "real" stack already but y is still virtual, and now we have to emit instructions to construct y (in this case LOAD_CONST (1); STORE_FAST (y)).

(Hm, this is getting long, and I haven't even gotten to the point where I understand why you are adding markup for this to the DSL. I will continue later.)

Fidget-Spinner commented 10 months ago

Having worked through your preso I think I understand a bit better, but it would be useful to work through some examples, starting with the Python code. For my own education, let me try one for each operation.

Oh no! That might undergo major revisions soon (mainly, adding symbolic information). Please hold off from spending too much time on it. Sorry if I wasted your time!

LOAD_CONST (1) Creates a "root node", which I understand to be a place to hold a value (here, the integer 1). Then it makes stack[0] point to that node. (Are there non-root nodes? Somehow I only recall reading about root node objects.)

The rest are reference nodes, which are just the pointers on the stack/locals.

STORE_FAST (x) Makes the variable x point to that same root node. (As an intermediate step it seems it is made to point to stack[0] but since that is immediately popped I'm not sure what the point is.)

That just shows how we wrote the OVERWRITE operation internally.

I'm not sure at what point the concrete uops are emitted from this. Possibly for each STORE_FAST, since that has an observable non-stack effect? Or when we encounter the EXIT_TRACE uop?

STORE_FAST is always emitted. The reason is that we are tracing not method-at-time, and there might be external uses of the variable outside the trace.

Now x is dynamic and y is static, so when we get to x + y we have x on the "real" stack already but y is still virtual, and now we have to emit instructions to construct y (in this case LOAD_CONST (1); STORE_FAST (y)).

Indeed, and also CHECK_INT and BINARY_OP_ADD_INT

gvanrossum commented 10 months ago

Thanks, I won't spend more time on it, but your answers are helpful, and I enjoy learning about this stuff.