faster-cpython / ideas

1.67k stars 49 forks source link

Load pyc files for modules in parallel with execution of earlier modules #493

Open vali19th opened 1 year ago

vali19th commented 1 year ago

The interpreter starts N+1 threads. T0 tries to schedule N ops to run at the same time and maintains a queue of upcoming ops, each with a dependency list (the id of the threads working on data that it needs).

For example, r = (a+b) * (b+c) + a + c can be split into two threads: T1 (r1 = a + b) and T2 (r2 = b + c). The next op in the queue is r3 = r1 * r2, which will only be scheduled when both T1 and T2 are done, but we can schedule T3 (r4 = a + c) before we compute r3

The scheduler will need a peek ahead limit instead of trying to keep all threads running constantly because that could consume too much memory.

These can be evaluated eagerly most of the time (or all the time)

The dependency list can be optimized by being created once with size N * M and using it to do one hot encoding instead of appending and popping the dependencies. N is the number of threads and M the peek ahead limit. This increases memory very little but reduces computation.

The memory overhead can be reduced by storing M items instead of M * N and converting the one-hot encoding to a number (00000101 -> 5). If N <= 8 then a byte is enough; otherwise use a small int

It might simplify things a bit if the code was rearranged in the "compile to bytecode" step in the same fashion that the scheduler would do, but introduce micro-threads for the parts that can run in parallel. This can work around the GIL because it can disable it in between those threads starting and finishing.

jneb commented 1 year ago

That's a compelling reason to work on removing the GIL. Fortunately, they're working on it already!

markshannon commented 1 year ago

The GIL is a fundamental part of CPython for now (although there is the nogil project) so we can't execute code, which rules out defining functions, and initializing variables. Evaluating expressions on primitives/immutable data is often done in the compiler and is so fine grained as to not be worth the synchronization overhead.

That leaves loading the pyc files for imports. Given that many module unconditionally import other modules, we could queue up the I/O on another thread.

I've taken the liberty of retitling the issue, accordingly.

Upon starting import of a module we can queue up the loading of the modules that it unconditionally imports. This will waste work, if the import fails, but that is rare.