faster-cpython / ideas

1.67k stars 49 forks source link

Millisecond startup #505

Open markshannon opened 1 year ago

markshannon commented 1 year ago

Currently, on my machine, python3.11 -S -c "" takes about 20ms, and python3.11 -c "" takes about 40ms. This is a long time to do nothing.

cat /usr/bin/python3.11 > /dev/null takes about 5ms, which shows that the entire executable can be loaded and every byte of it read and piped to stdout in less than 5ms.

To execute python3.11 -c "" nothing needs to be done, so it is reasonable to expect that it can be done with only a small fraction of the executable, and thus could be done in one millisecond.

We should aim for python3.11 -S -c "" to take one millisecond and python3.11 -c "" to take 2-3 milliseconds.

How can this be done?

Good profiling information

First of all we need to know when we are spending time, in case there are any major inefficiencies and redundancies that can removed.

Starting the interpreter in one millisecond

By taking a snapshot of the VM immediately before parsing the command line code, we can recreate the memory state very quickly.

Importing site.py, and other modules in one or two milliseconds

The code object is key here. We need to create code objects quickly. We cannot effectively freeze code objects any more, as they are effectively mutable, but we can freeze data that would allow us to recreate the code object with little more than a memcpy.

This probably means that code objects should not be part of the co_consts array, allowing the constant array to be cleanly frozen.

Potential problems

In order to parse anything, we need the parser, and the PEG parser is very large, so might cause us to load more code than we'd like, plus it will evict everything else from the icache.

Snapshots need fixing up. For example, file descriptors need to opened, they cannot just be copied from disk. If there is too much volatile state it could slow down startup too much.

This might just be too tricky to implement. Retrofitting this to the existing VM is much harder than implementing in a new VM.

gvanrossum commented 1 year ago

What about hash randomization? This affects the layout of every dict.

methane commented 1 year ago

This probably means that code objects should not be part of the co_consts array, allowing the constant array to be cleanly frozen.

constant array includes bytes and Unicode objects. They have hash cache. So they are mutable.

markshannon commented 1 year ago

What about hash randomization? This affects the layout of every dict.

Not since 3.6

markshannon commented 1 year ago

This probably means that code objects should not be part of the co_consts array, allowing the constant array to be cleanly frozen.

constant array includes bytes and Unicode objects. They have hash cache. So they are mutable.

These are what we might call "idempotent immutable", if immortal. Once the hash has been computed, they are immutable. Since the hash is deterministic (within any given process) there is no race condition in computing the hash, so the objects can be shared between threads.

We might want to copy the data ourselves to avoid page faults, or leave it up to the O/S to perform copy-on-write. We can experiment later.

itamaro commented 1 year ago

We cannot effectively freeze code objects any more, as they are effectively mutable

Is this solely due to specialization, with inline specialized bytecode? My concern is that taking immutability away from code objects will have many downsides, like immortalization no longer being applicable to code objects, inability to freeze them, and other cool tricks (thinking about the Alibaba memory mapping project). I can see why inline bytecode caching is ideal for quickening (memory locality and whatnot), but have you explored alternatives that allow maintaining code object immutability?

gvanrossum commented 1 year ago

What about hash randomization? This affects the layout of every dict.

Not since 3.6

Then where is the hash table? I know the public API now iterates by insertion order, but I assume there's still a hash table somewhere in memory. I assume this can be dealt with in the fixup phase though.

Snapshots need fixing up. For example, file descriptors need to opened, they cannot just be copied from disk.

The environment dict would need to be updated, for another example.

We cannot effectively freeze code objects any more, as they are effectively mutable

Like Itamar, I'm not sure why code objects would need special treatment. The mutable parts would seem to be specialized opcodes and inline caches, and I don't see why we couldn't just dump their current state to disk and read it back and be in business. Don't we re-specialize every once in a while anyway?

markshannon commented 1 year ago

@itamaro

Is this solely due to specialization, with inline specialized bytecode?

PEP 659 adaptive specialization is the main reason, yes, but PEP 699 instrumentation will also change the bytecode (at the C level). This means that code objects cannot be shared across processes, but much of the code object can be shared. If we change code objects to look something like:

struct PyCodeObject {
     HEADER
     ImmutableData *immutable;
     /* Mutable data here */
    _Py_CODEUNIT code[1];
};

Then all the immutable data (exception table, line table, etc.) can be shared.

The current design of code objects includes code objects in the constant table. If we want to put the constant table into the immutable data we will need to remove code objects from the constant table.

We could create the code object in the bytecode, but that does create a bit of a chicken and egg problem with creating the module code object.

markshannon commented 1 year ago

@gvanrossum

Then where is the hash table?

This will need creating at runtime, but could be done lazily at runtime like hashcodes for strings.

One way to do this could be through the lookup pointer, which would initialize the table and replace itself with the normal function pointer. Alternatively, by intitializing all hashes to -1, the initialization code could be inserted into the error handling path at little cost.

I don't see why we couldn't just dump their current state to disk and read it back and be in business.

The caches include pointer to objects and version numbers. Those cannot be saved to disk.

methane commented 1 year ago

This will need creating at runtime, but could be done lazily at runtime like hashcodes for strings.

https://github.com/python/cpython/blob/main/Objects/dictobject.c#L1370

build_indices_unicode is the function rebuild only hashtable (indices) without modifying its body (entries). Rebuiding indices is much faster than rebuilding entire dict.

BrandonTheBuilder commented 1 year ago

Looking at a callgrind profile from Python 3.12.0a5+ (heads/main:0f89acf6cc, Feb 27 2023, 13:29:42) The majority of time is spent in dictobject.c Could we use a tool like gperf to statically compile hashmaps for runtime internals (attr dicts and mro's of builtin types)? Wrapping these prebuilt maps in a dictionary like object that falls back to creating a dictobject on insertion would allow them to be used in user code as well. Screenshot 2023-02-27 at 3 45 02 PM