colesbury / nogil-3.12

Multithreaded Python without the GIL (experimental rebase on 3.12)
Other
145 stars 7 forks source link

Slowdown when modifying instance member #11

Open mdboom opened 1 year ago

mdboom commented 1 year ago

Bug report

Using the fibonacci example from the old nogil README, I'm able to see the time-per-call decrease with more threads:

import sys
from concurrent.futures import ThreadPoolExecutor

print(f"nogil={getattr(sys.flags, 'nogil', False)}")

def fib(n):
    if n < 2: return 1
    return fib(n-1) + fib(n-2)

threads = 8
if len(sys.argv) > 1:
    threads = int(sys.argv[1])

with ThreadPoolExecutor(max_workers=threads) as executor:
    for _ in range(threads):
        executor.submit(lambda: print(fib(34)))
$ time ./python nogil_bench.py 1
0.89user 0.00system 0:00.97elapsed 91%CPU  
# 0.97s per call
$ time ./python nogil_bench.py 8
11.75user 0.00system 0:01.97elapsed 595%CPU
# 0.24s per call

However, when I modify the benchmark to update an instance member, the time per call skyrockets. Note that the instance isn't shared between threads -- each thread gets its own instance.

import sys
from concurrent.futures import ThreadPoolExecutor

print(f"nogil={getattr(sys.flags, 'nogil', False)}")

class Fibonacci:
    def __init__(self, x):
        self.x = x

    def calculate(self, n):
        # This line doesn't actually matter for the calculation, but this is what
        # causes the nogil threaded performance to drop precipitously.
        self.x += 1

        if n < 2:
            return 1
        return self.calculate(n - 1) + self.calculate(n - 2)

def fib(n):
    f = Fibonacci(1)
    return f.calculate(n)

threads = 8
if len(sys.argv) > 1:
    threads = int(sys.argv[1])

with ThreadPoolExecutor(max_workers=threads) as executor:
    for _ in range(threads):
        executor.submit(lambda: print(fib(34)))
$ time ./python nogil_bench_slow.py 1
2.24user 0.00system 0:02.44elapsed 92%CPU
# 2.44s per call
$ time ./python nogil_bench_slow.py 8
76.39user 150.70system 1:22.25elapsed 276%CPU
# 11.03s per call

Looking at Linux perf, I see that _PyObject_GetInstanceAttribute is 10% of runtime on the slow version, and 0.0% in the fast version, so it is seemingly lock contention getting an instance attribute.

I do not see this pathological behavior on nogil-3.9, so I'm hoping this is just an isolated bug that is fixable independently.

$ time ./python nogil_bench_slow.py 1
1.50user 0.00system 0:01.63elapsed 92%CPU
# 1.63s per call
$ time ./python nogil_bench_slow.py 8
18.40user 0.01system 0:02.91elapsed 632%CPU
# 0.36s per call

Please ignore the fact that that line is meaningless to calculating Fibonacci -- this is my attempt at breaking down pyperformance's raytrace benchmark into a more minimal example. I'm sure you agree that modifying instance members is a pretty common thing to do. :)

Your environment

Debian Buster 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz