faster-cpython / ideas

1.67k stars 49 forks source link

How are bytecode strings used? #538

Open brandtbucher opened 1 year ago

brandtbucher commented 1 year ago

We've been talking recently about how we might want to change the format of our instructions in different ways (variable-length instructions, wider opargs, compression of serialized forms, etc.). I think it's useful to consider all of the different forms that bytecode takes throughout a typical Python process when discussing these ideas.

The lifecycle of a string of bytecode (opcodes, opargs, and caches) currently looks something like this:

graph TB

COMPILER((compiler))
USER((user))
RAW[raw bytes]
MARSHALLED[marshalled bytes]
FROZEN[frozen bytes]
PYC[.pyc file]
DEEPFROZEN[deep-frozen code tail]
HEAP["executable code tail (heap)"]
STATIC["executable code tail (static)"]

style RAW fill:blue
style MARSHALLED fill:blue
style FROZEN fill:blue
style PYC fill:blue
style DEEPFROZEN fill:blue
style STATIC fill:red
style HEAP fill:red

COMPILER --> |compile| RAW
RAW -----> |disassemble| USER

MARSHALLED --> |cache| PYC
PYC --> |import| MARSHALLED

MARSHALLED --> |unmarshal| RAW
RAW --> |marshal| MARSHALLED

MARSHALLED -.-> |freeze| FROZEN
FROZEN -.-> |deepfreeze| DEEPFROZEN
DEEPFROZEN --> |quicken| STATIC
STATIC --> |unquicken| DEEPFROZEN
STATIC --> |copy + unquicken| RAW

FROZEN --> |import| MARSHALLED

RAW -----> |copy + quicken| HEAP
HEAP -----> |copy + unquicken| RAW

The boxes in red are quickened forms, while the boxes in blue are unquickened forms. Quickening (_PyCode_Quicken) currently initializes adaptive counters and inserts superinstructions. Unquickening (deopt_code) removes superinstructions, converts other instructions back to their adaptive form, and zeroes out all caches (including counters).

Let's remove frozen and cached modules, for simplicity (they're basically just marshalled bytes):

graph TB

COMPILER((compiler))
USER((user))
RAW[raw bytes]
MARSHALLED[marshalled bytes]
DEEPFROZEN[deep-frozen code tail]
HEAP["executable code tail (heap)"]
STATIC["executable code tail (static)"]

style RAW fill:blue
style MARSHALLED fill:blue
style DEEPFROZEN fill:blue
style STATIC fill:red
style HEAP fill:red

COMPILER --> |compile| RAW
RAW ----> |disassemble| USER

MARSHALLED --> |unmarshal| RAW
RAW --> |marshal| MARSHALLED

MARSHALLED -.-> |deepfreeze| DEEPFROZEN
DEEPFROZEN --> |quicken| STATIC
STATIC --> |unquicken| DEEPFROZEN
STATIC --> |copy + unquicken| RAW

RAW ----> |copy + quicken| HEAP
HEAP ----> |copy + unquicken| RAW

Some observations:

  1. It would simplify things a lot (especially deepfreeze) if we didn't have a concept of "quickening" or "unquickening". Perhaps a more useful model would be the ability to "reset" code to its initial quickened form, for consumers of co_code and finalization of deepfrozen code objects. This means that superinstructions and non-zero counters would be present in co_code, but no specialized instructions or other populated caches. If we do this, we only have one idempotent transformation that can be applied to the bytecode, and what we currently call "quickening" can be entirely encapsulated in the compiler, where it belongs (not even marshal or code objects need to understand it). If so, the new graph would be roughly:
graph TB

COMPILER((compiler))
USER((user))
RAW[raw bytes]
MARSHALLED[marshalled bytes]
HEAP["executable code tail (heap)"]
STATIC["executable code tail (static)"]

style RAW fill:red
style MARSHALLED fill:red
style STATIC fill:red
style HEAP fill:red

COMPILER --> |compile| RAW
RAW ---> |disassemble| USER

MARSHALLED --> |unmarshal| RAW
RAW --> |marshal| MARSHALLED

MARSHALLED -.-> |deepfreeze| STATIC
STATIC --> |reset| STATIC
STATIC --> |copy + reset| RAW

RAW ---> |copy + reset| HEAP
HEAP ---> |copy + reset| RAW

At this point, there's not really any difference between static and heap code (we just need to reset static code at finalization):

graph TB

COMPILER((compiler))
USER((user))
RAW[raw bytes]
MARSHALLED[marshalled bytes]
HEAP[executable code tail]

style RAW fill:red
style MARSHALLED fill:red
style HEAP fill:red

COMPILER --> |compile| RAW
RAW ---> |disassemble| USER

MARSHALLED --> |unmarshal| RAW
RAW --> |marshal| MARSHALLED

MARSHALLED -.-> |deepfreeze| HEAP
HEAP --> |reset| HEAP

RAW --> |copy + reset| HEAP
HEAP --> |copy + reset| RAW
  1. While it's an open question whether marshal should have an intimate knowledge of the bytecode format for compression purposes, it's certainly desirable to at least marshal the bytecode directly in and out of the code object's tail (and not through an intermediate bytes object):
graph TB

COMPILER((compiler))
USER((user))
RAW[raw bytes]
MARSHALLED[marshalled bytes]
HEAP[executable code tail]

style RAW fill:red
style MARSHALLED fill:red
style HEAP fill:red

COMPILER --> |compile| RAW
RAW ---> |disassemble| USER

MARSHALLED --> |unmarshal| HEAP
HEAP --> |marshal + reset| MARSHALLED

MARSHALLED -.-> |deepfreeze| HEAP
HEAP --> |reset| HEAP

RAW --> |copy + reset| HEAP
HEAP --> |copy + reset| RAW

If marshal has a way of building code without an intermediate bytes object, then the compiler does too:

graph TB

COMPILER((compiler))
USER((user))
RAW[raw bytes]
MARSHALLED[marshalled bytes]
HEAP[executable code tail]

style RAW fill:red
style MARSHALLED fill:red
style HEAP fill:red

COMPILER --> |compile| HEAP

MARSHALLED --> |unmarshal| HEAP
HEAP --> |marshal + reset| MARSHALLED

MARSHALLED -.-> |deepfreeze| HEAP
HEAP --> |reset| HEAP

HEAP --> |copy + reset| RAW
RAW --> |disassemble| USER

So, by changing these two relatively minor things, it seems that we can simplify our handling of the bytecode quite a bit.

brandtbucher commented 1 year ago

Another thought:

If we start having actual opargs (not caches) wider than one byte, we're going to need to decide if we want to use a fixed-endianness, or the platform's native endianness. It seems to me, based on the above graphs, that having a single format for both serialized and "live" forms makes the most sense. We can use a little-endian format since that's the common case, but even on big-endian platforms, both GCC and Clang still seem to understand what we're doing. So I don't imagine that the cost will be high, even in that case.

gvanrossum commented 1 year ago

So the proposal is that the compiler generates (presumably in a new pass after everything else) quickened code, i.e. with super-instructions but no specialized instructions, and with cache counters set to their initial value. And the marshalled form is the same. The "reset" operation gets rid of specialized instructions and resets caches to their initial value (i.e. counter in its initial non-zero state and all other cache fields zeroed out).

Maybe the "primitive" operation should just be copy-and-reset? Then that can be used to get the raw bytes for the disassembler, it can be used to copy the bytes in marshal, and with some care about aliasing it can be used to reset the code in place (but do we ever need that?).

brandtbucher commented 1 year ago

(presumably in a new pass after everything else)

Sure, but it could be part of the final instruction emission that we already have.

Maybe the "primitive" operation should just be copy-and-reset?

Interesting idea! I like it.

and with some care about aliasing it can be used to reset the code in place (but do we ever need that?)

Yeah, when finalizing static (deep-frozen) code objects.

brandtbucher commented 1 year ago

One other point I glossed over that still needs ironing out: the ergonomics of pre-initialized counters on different architectures. I think the ideal solution is probably the same as the closely-related "16-bit oparg" problem discussed above: just use a fixed endianness.

Another awkward thing is that we will now have these counters present in co_code, etc. Currently, people walking over unquickened bytecode can just ignore CACHE opcodes, but counters would necessarily be non-zero.

However, using a fixed-endianness could help this: whatever the first byte of an initialized counter ends up being, just add a new COUNTER opcode with that same value.

Since we initialize the counters to 17 ((1 << 4) | 1), using the common little-endian formant for bytcode counters would mean that COUNTER == 17, just as CACHE == 0 now.

gvanrossum commented 1 year ago

So we should look at what the ADAPTIVE_COUNTER macros would look like.

I think I can live with this.

markshannon commented 1 year ago

Please leave the in-memory form as is. There is no need to break 16 bit integers into two bytes. Endianness is not an issue. Marshal can save and load 16 bit values just fine.

Replace the term "bytecode string" with the term "instruction sequence", and the issue of endianness goes away. The conversion from a stream of bytes to a sequence of instructions and back can be wrapped in a pair of functions and no other code need care about it.

A quick example, saving and restoring the format INSTR_FMT_IBC0 (This can all be automatically generated)


struct instruction_IBCO {
    uint8_t opcode;
    uint8_t oparg;
    uint16_t counter;
    uint16_t cache0;
};

int read(Stream *stream, instruction_IBCO *inst)
{
    RETURN_IF_ERROR(read_byte(stream, &inst->opcode));
    RETURN_IF_ERROR(read_byte(stream, &inst->oparg));
    inst->counter = adaptive_counter_init();
    inst->cache0 = 0;
    return SUCCESS;
}

int write(instruction_IBCO *inst, Stream *stream)
{
    RETURN_IF_ERROR(write_byte(stream, inst->opcode));
    RETURN_IF_ERROR(write_byte(stream, inst->oparg));
    return SUCCESS;
}
gvanrossum commented 1 year ago

Endianness is not an issue. Marshal can save and load 16 bit values just fine.

Okay, but it will mean that marshal must understand the instruction encoding in detail, so it can know when to swap two bytes if the native format is big-endian (marshal always uses little-endian). Currently marshal doesn't know the instruction format, it just copies bytes.

EDIT: Disregard the following

~Or are you proposing that (if the native endianness differs from the standard marshal endianness) marshal should just swap all bytes pairs while it is copying? I think that might work, if we change the in-memory encoding to always put the opcode in the low byte and the oparg in the high byte (and similar for a subsequent oparg2/oparg3 pair if present, etc.).~

~This byte swap would only be needed if the endianness differs, so on common platforms (ix64, ARM) it's a no-op. That's nice, except we need at least one big-endian buildbot to keep us from accidentally breaking this aspect.~

brandtbucher commented 1 year ago

Well, it could be ignorant of the instruction format if we just treat the instruction stream as an array of 16-bit values. It doesn't really matter if something is a counter, opcode, 8-bit oparg, or 16-bit oparg, as long as the 16-bit values are 16-bit aligned in the stream. Marshal just walks over them, swapping bytes if needed in r_short and w_short like it already does.

I like that idea (it avoids needing to hack around a fixed-endian counter). The only oddity is that the actual bytes of co_code (in memory) will now be platform-dependent for counters and wide opargs. The "counter" case will be a bit awkward for anything walking over the bytecode as opcode/oparg pairs, since you can skip CACHE opcodes now, but the first byte of an initialized counter will be platform-specific.

But I think I can live with that... it's a problem we'll already have to face with wide opargs and variable-length instructions. I think the solution is just to encourage people to use dis when walking over bytecode if this is an issue for them.

gvanrossum commented 1 year ago

Yeah, interpreting cache entries as opcodes (e.g. CACHE) just isn't tenable, it's at best a debugging hack.

We'd end with the opcode/oparg pair being the low/high byte instead of the first/second byte, but that can all be hidden in macros or platform-dependent struct definitions.

However, note that Mark is after something else, which is actually much closer to the PR we decided to abandon over the holidays. Maybe you should just dust that one off and switch to using the new opcode_metadata.h (in particular the INSTR_FMT enums).

brandtbucher commented 1 year ago

Sounds good. I'm logging off for the day, but I'll read any more comments and pick this work up tomorrow.

arhadthedev commented 1 year ago

I think the ideal solution is probably the same as the closely-related "16-bit oparg" problem discussed above: just use a fixed endianness.

Can we use a native endianness and communicate it through MAGIC_NUMBER instead? For example, each instruction addition/removal may get not one but two magic increments, one per endianness:

 #     Python 3.12a1 3516 (Add COMAPRE_AND_BRANCH instruction)
+#     Python 3.12a? 3517 (Add FOO_BAR for little-endian platforms)
+#     Python 3.12a? 3518 (Add FOO_BAR for big-endian platforms)

It's worth noting that we don't need to allocate specific bit flags in MAGIC_NUMBER; any diversity of values is enough for the interpreter to reject incorrect endianness.

gvanrossum commented 1 year ago

@arhadthedev I don't like having two different .PYC formats. Especially since this would mean that the marshal format itself would become platform-specific, which it never has been before. The marshal format (which does not have a magic number itself) is occasionally used in different contexts, and it would be a regression if it was no longer platform-independent.

arhadthedev commented 1 year ago

Especially since this would mean that the marshal format itself would become platform-specific, which it never has been before.

I didn't think about this, thank you for clarification.

markshannon commented 1 year ago

We'd end with the opcode/oparg pair being the low/high byte instead of the first/second byte, but that can all be hidden in macros or platform-dependent struct definitions.

We don't need any of this. There is no endianness in the definition of a code unit, nor does there need to be in marshal.

https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html

gvanrossum commented 1 year ago

Sure, if we go the route you indicated in your earlier comment we don't need explicit endianness (it's implicit in all that code). I have some concerns about building knowledge about the instruction set into marshal too though, see https://github.com/python/cpython/pull/99555, esp. the reasons it was closed unmerged. We should revisit that decision.