colesbury / nogil

Multithreaded Python without the GIL
Other
2.91k stars 107 forks source link

Corner-case performance degradation #121

Open ppolewicz opened 1 year ago

ppolewicz commented 1 year ago

This might be a known limitation and I'm pretty sure it is something one shouldn't be doing when using nogil mode. There is an easy workaround for now to just PYTHONGIL=1 in such a case, but I thought it might be interesting to show the result of the test I've ran:

1 million increments per thread, 3 threads:

docker image PYTHONGIL time slowdown result
python:3.9-slim not set 0m1,014s 1x 1608925
nogil/python not set 0m46,443s 45.8x 1008232
nogil/python 1 0m1,047s 1.032x 3000000

5 million increments per thread, 3 threads:

docker image PYTHONGIL time slowdown result
python:3.9-slim not set 0m2,110s 1x 7823878
nogil/python not set 3m44,655s 106.47x 5024261
nogil/python 1 0m2,127s 1.0080x 15000000

1 million increments per thread, 10 threads:

docker image PYTHONGIL time slowdown result
python:3.9-slim not set 0m1,702s 1x 3051994
nogil/python not set 2m14,190s 78.84x 1002417
nogil/python 1 0m1,761s 1.0346x 10000000

environment: Ubuntu VM running on 5 cores i5-9600KF CPU @ 3.70GHz

It looks like nogil/python with PYTHONGIL=1 provides a more consistent += than the official CPython release. Not sure if it's intended. There was no nogil 3.12 docker image, so I didn't test there.

Code

incrementor.py

import sys
print('nogil', hasattr(sys.flags, 'nogil') and sys.flags.nogil)

import threading

THREAD_COUNT = 3
BY_HOW_MUCH = 1_000_000

class Incrementor:
    def __init__(self):
        self.c = 0

def incr(incrementor, by_how_much):
    for i in range(by_how_much):
        incrementor.c += 1

incrementor = Incrementor()

threads = [
    threading.Thread(target=incr, args=(incrementor, BY_HOW_MUCH))
    for i in range(THREAD_COUNT)
]

for t in threads:
    t.start()

for t in threads:
    t.join()

print(incrementor.c)

Dockerfile

FROM nogil/python
#FROM python:3.9-slim
#ENV PYTHONGIL=1

WORKDIR /usr/src/app
COPY incrementor.py .
CMD ["python3", "incrementor.py"]
some raw test results ```sh $ time docker run -it --rm --name incrementor-running incrementor Sending build context to Docker daemon 4.608kB Step 1/4 : FROM python:3.9-slim ---> 313d2f483acd Step 2/4 : WORKDIR /usr/src/app ---> Using cache ---> 21fe8413ac4b Step 3/4 : COPY incrementor.py . ---> Using cache ---> e12f9c4f202c Step 4/4 : CMD ["python3", "incrementor.py"] ---> Using cache ---> df42d07b5298 Successfully built df42d07b5298 Successfully tagged incrementor:latest nogil False 1608925 real 0m1,014s user 0m0,016s sys 0m0,016s ``` ```sh $ time docker run -it --rm --name incrementor-running incrementor Sending build context to Docker daemon 4.608kB Step 1/4 : FROM nogil/python ---> fb4c356acc04 Step 2/4 : WORKDIR /usr/src/app ---> Using cache ---> 7c0cc98d04d0 Step 3/4 : COPY incrementor.py . ---> 91b301e2e554 Step 4/4 : CMD ["python3", "incrementor.py"] ---> Running in 8bf5cd50d44a Removing intermediate container 8bf5cd50d44a ---> 0604276a25f2 Successfully built 0604276a25f2 Successfully tagged incrementor:latest nogil True 1008232 real 0m46,443s user 0m0,021s sys 0m0,014s ``` ```sh $ time docker run -it --rm --name incrementor-running incrementor Sending build context to Docker daemon 4.608kB Step 1/5 : FROM nogil/python ---> fb4c356acc04 Step 2/5 : ENV PYTHONGIL=1 ---> Running in c541a1442b74 Removing intermediate container c541a1442b74 ---> 242047eabdba Step 3/5 : WORKDIR /usr/src/app ---> Running in 68c1395b279f Removing intermediate container 68c1395b279f ---> 4f2bc8211e90 Step 4/5 : COPY incrementor.py . ---> 7bc7aa0dc555 Step 5/5 : CMD ["python3", "incrementor.py"] ---> Running in 45fc225db0eb Removing intermediate container 45fc225db0eb ---> 5d79a20dfb2b Successfully built 5d79a20dfb2b Successfully tagged incrementor:latest nogil False 3000000 real 0m1,291s user 0m0,015s sys 0m0,019s ```
pjurgielewicz commented 1 year ago

Hi @ppolewicz, as far as I understand correctly this is somewhat expected behavior due to biased reference counting - namely non-owning threads can access shared objects' refcount fields using atomics only.

In my tests, I was also trying to violate queue performance, which is relatively easy.

import time
import threading
from queue import SimpleQueue, Queue

def producer(queue):
    buf = bytearray(1500)
    for i in range(10**6):
        queue.put((i, buf))
    queue.put(None)  # Add a sentinel value to indicate the end of data

def consumer(queue):
    while True:
        item = queue.get(timeout=10**0)
        if item is None:
            break
        # Process the item here (e.g., print or do some computation)

# Create a SimpleQueue object
# queue = Queue()
queue = SimpleQueue()

# Create the producer and consumer threads
producer_thread = threading.Thread(target=producer, args=(queue,))
consumer_thread = threading.Thread(target=consumer, args=(queue,))

# Start the timer
start_time = time.time()

# Start the threads
consumer_thread.start()
producer_thread.start()

# Wait for the threads to finish
producer_thread.join()
consumer_thread.join()

# Calculate the elapsed time
end_time = time.time()
elapsed_time = end_time - start_time

# Print the performance metrics
print(f"Data exchange performance: {elapsed_time} seconds.")

In nogil CPython there are 2 (or more) threads concurrently fighting for access to the resource (queue) leading to enormous overhead while GIL prevents that (on my machine there is 2 orders of magnitude difference in execution speed between nogil and regular CPython).

All in all nogil requires a slightly different approach to data management. In the case of queues, for my project the solution with greedy data buffering on the client side provides good results.

ppolewicz commented 1 year ago

GIL required workarounds and biased reference counting may sometimes require workarounds too. It'll be harder to run into those, I hope.

If my understanding is correct, incr() shouldn't be touching the reference counter of the incrementor object a million times, but it's rather the integers assigned to c that are being DECREFed with atomics. I switched it to ^= True in hope of getting rid of that effect as True and False should be immortal, but it didn't help. Am I missing something? Is nogil/python immortalizing True and False or was that only a 3.12 thing?

pjurgielewicz commented 1 year ago

@ppolewicz

In your code try switching line:

threads = [
    threading.Thread(target=incr, args=(incrementor, BY_HOW_MUCH))
    for i in range(THREAD_COUNT)
]

to:

threads = [
    threading.Thread(target=incr, args=(Incrementor(), BY_HOW_MUCH))
    for i in range(THREAD_COUNT)
]

And you will see the maximum possible performance on nogil. Of course, the result will be wrong but my point here is that in the original code, 3 child threads are simultaneously trying to access the same instance of the Incrementor object thus modifying its refcnt (non-atomic) then these threads are trying to modify field self.c which also affects the counter - and all these operations are atomic (lock-acquiring and potentially CPU cache-invalidating operations).

About immortalization: yes, immortalization is part of nogil 3.9 (_Py_REF_IMMORTAL_MASK ) but its existence does not kick in the place you mentioned (in my counterexample 3 threads are freely creating and assigning ints to per-thread self.c object)

colesbury commented 1 year ago

I think @pjurgielewicz's response has covered most of this, but just in case:

1) Yes, this is expected. (The nogil with PYTHONGIL=1 behavior is a bit surprising, but I think that's just because there are fewer places for the GIL to be released; don't read too much into it.) 2) Depending on the interpreter,incr() may touch the reference count of incrementor. it definitely does in CPython because access to local variables requires a LOAD_FAST which copies the local variable to the stack and increments the reference count. To be honest, I can't remember if the nogil-3.9 bytecode works this way, but it doesn't matter because that's not going to get upstreamed. 3) There's also contention on the attribute c. You'll have this even if you switch to booleans and immortalize incrementor.

The general problem of memory contention is not specific to Python or nogil -- you can see similar behavior when modifying the same variable from many threads in C or C++ -- but it's potentially more of an issue in nogil because there's possible contention on more things: 1) The reference count of incrementor 2) The reference count of the numbers themselves 3) The dict lock that protects incrementor.c

The provided example is probably not worth optimizing for, but I expect there will be more real world examples where memory contention is an issue and we have a few possible strategies for mitigating it, including immortalization and deferred reference counting, but how they're applied will depend on the actual context.