faster-cpython / ideas

1.67k stars 49 forks source link

Avoid marshal for creating code objects from serialized data. #566

Open markshannon opened 1 year ago

markshannon commented 1 year ago

Once the work to allow any object as the "code" of a frame is done, we can take advantage of that to speed up creation of code objects from serialized data.

The idea is that the serialized data will consist of two parts:

  1. A sequence of immutable bytecode
  2. Supporting binary data.

Creation of the top-level (module) code object would be done as follows:

  1. Create a "module initializer" object, consisting of a pointer to the binary data and debug info like the name and filename.
  2. Create a frame, setting the "code" field to the module initializer and setting the instruction to point at the instructions.
  3. Start executing in the interpreter.

What are the advantages of this?

Creating the instruction sequence

We can create the instruction in much the same way as marshal serializes; recursively emitting code for sub-objects until the entire object is complete.

To do this will need some new instructions and a few new instrinsics.

New general purpose instructions:

Insructions to create objects from binary data.

These instructions will create an object from the binary data, advancing the pointer.

New instrinsic functions

We already have an instruction for making tuples.

The instruction sequence would finish with MAKE_CODE; RETURN_VALUE returning the completed instruction on the stack. Or, we could add another instruction, START_CODE at the end to execute the code object and return the completed module.

Examples

Creation of the tuple (1, "a", 37.0, (2, "foo"))

LOAD_INT 1
LOAD_COMMON_NAME "a"
MAKE_FLOAT 37.0
LOAD_INT 2
MAKE_STRING "foo"
BUILD_TUPLE 2
BUILD_TUPLE 3

Creation of a code object would look like something like this:

(Code to create names tuple)
(Code to create consts tuple)
MAKE_STRING name 
MAKE_STRING qualname
COPY n (filename will be shared for all code objects in module)
MAKE_CODE