faster-cpython / ideas

1.68k stars 48 forks source link

Prototype a "subinterpreter pool" for multiprocessing #606

Open mdboom opened 1 year ago

mdboom commented 1 year ago

Just as a first foray into learning more about subinterpreter behavior on existing code, build a prototype "SubinterpreterPool" on the multiprocessing API.

Cc: @ericsnowcurrently

ericsnowcurrently commented 1 year ago

You may want to take a look at https://github.com/jsbueno/extrainterpreters.

CC @jsbueno

jsbueno commented 1 year ago

Hi - I've paused extrainterpreters for a couple weeks due to <things> - but yes, the roadmap is to have concurrent.futures compatibility in it.

For now, we have threading.Thread(target=...).start() compatibility (+ return values) and in need for someone who could build a small extension that is needed for locking purposes under Windows.

mdboom commented 1 year ago

I have a SubinterpreterPool for multiprocessing that uses the memoryboard abstraction from extrainterpreters, and a pipe to handle blocking read/writes to it here: https://github.com/python/cpython/compare/main...mdboom:subinterpreter-pool-memoryboard?expand=1

I'll post some performance results etc. over the next little while.

ericsnowcurrently commented 1 year ago

CC @tonybaloney

mdboom commented 1 year ago

results

Results of running the nbody benchmark using all 16 virtual cores on an 8 core system. The choice of benchmark here probably has a considerable impact on the results. Don't take any of this as definitive. More benchmarks with different characteristics will need to be tested.

Methods

At the core of multiprocessing is a "work handler loop", which receives tasks (a function and some arguments) from an input queue, and then puts the result in an output queue.

I've implemented two and a half different methods for running this work handler loop with subinterpreters, all of which have very similar performance in terms of runtime and memory.

In both cases, subinterpreters are managed using Lib/test/support/interpreters.py which is an experimental implementation of PEP 554.

Method one

As with regular multiprocessing using subprocesses, each worker in the pool has its own thread in the main interpreter. Inside each of those threads, a subinterpreter runs multiprocessing's existing "work handler loop", unmodified.

Since a queue.SimpleQueue can not be used to send objects between subinterpreters, work is sent to the loop using a LockableBoard from the extrainterpreters project. Objects can be added and removed from a LockableBoard from multiple subinterpreters, and it enforces that only one subinterpreter can access an object at a time. The object must be a shareable type, so the objects are pickled/unpickled in order to send back and forth, but the pickle data itself does not need to be copied.

Since the worker loop needs to block waiting for more tasks and the result handler needs to block waiting for more results, an os.pipe is used to communicate between interpreters when new data is ready to be read from the LockableBoard. Experimentally, this was much faster than polling in a Python loop.

(extrainterpreters also contains a Queue class that is more like what is needed, but it doesn't appear to support blocking reads yet -- but that could be on me for not understanding it.)

Method two

As above, each worker in the pool starts a new thread in the main interpreter. Each of these threads has exactly one subinterpreter contained in it, but rather than that subinterpreter running the work handler loop itself, the loop remains running in the thread on the main interpreter, and each task is sent individually to the subinterpreter one-at-a-time.

Since the work handler is just a regular thread, it can use queue.SimpleQueue to receive tasks and return results.

When a task is received from the queue in the worker thread, the task data is pickled and sent to the subinterpreter using interpreter.run().

A modification to interpreter.run() was made (going beyond PEP554) to run code in eval mode, so that a return value can be obtained and returned. For safety, this enforces that the return value is a subinterpreter-shareable object. To support all objects, the code that runs a task inside a subinterpreter returns a pickled copy of the return value.

Specifically, the following code is run when initializing each subinterpreter:

import pickle

def _f(p):
    func, args, kwargs = pickle.loads(p)
    return pickle.dumps(func(*args, **kwargs))

And the following code runs each task, where pickle is a (func, args, kwargs) triple:

_f({pickle!r})

Method three

This method is identical to method two, except the return value from the subinterpreter is sent over a pipe back to the worker thread in the main interpreter. This does not require the extension to PEP554 to run code in eval mode. Unlike the pipe in method one, this uses one pipe per worker thread, so the individual workers are never in contention over a single pipe.

More info: My branch of CPython, benchmarks.

mdboom commented 1 year ago

Another benchmark, this time it's "do nothing", but sending large lists of floats back and forth across the work queue to measure the message-passing overhead of the approaches.

It's clear that "Method 2" has a slight speed advantage over "Method 1", because there is no locking required between the subinterpreter and the work thread (on the main interpreter) that owns it. It's really just sequential code passing a pointer to an object from one interpreter to another. It is not clear that passing an object that way (even one with a refcnt of 1 that's a built-in non-container type) is safe, in the general case, however.

Both subinterpreter methods have a speed advantage over subprocesses in this case. In broad strokes, as the message passing overhead dominates, subinterpreters are a win over subprocesses. All of this with the proviso that the subinterpreter code is neither as well-optimized or as robust as the subprocess code, so don't draw too many conclusions.

"Method 3" simply no longer works -- since the pipe is across two things in the same thread (the worker thread on the main interpreter, and the subinterpreter that it owns), we are limited by the pipe's buffer size. Making this work would require another thread to empty the pipe as it's written to, and I'm doubtful that would have any speed advantage.

data_pass

JunyiXie commented 6 months ago

Hello, is there any follow-up progress on this part of the work? The existing subinterpreter API is relatively low-level and not easy to use.