Investigate why nogil performs poorly on `bench_thread_pool`

mdboom commented 1 year ago

Here's the benchmark, which we know is 53% slower on nogil-latest vs. upstream:

https://github.com/python/pyperformance/blob/main/pyperformance/data-files/benchmarks/bm_concurrent_imap/run_benchmark.py#L19

It looks to be sending lists of ints of length 10 to each thread, and then calling a very simple function on each of the values. It's possible the coordination overhead of passing those lists of ints is dominating over the actual work. As a start, it might be interesting to see what happens as the length of the lists increases, or as the amount of work being done in the function increases.

Cc: @brandtbucher (as one who pointed this out).

gvanrossum commented 1 year ago

(In particular the 53% number comes from Linux.)

mdboom commented 1 year ago

(In particular the 53% number comes from Linux.)

Yes -- good callout. On Windows, it's only 23% slower (and is more in the middle of other benchmarks, not the slowest outlier like on Linux).

mdboom commented 1 year ago

I modified the benchmark so (a) we test different chunk sizes, and (b) we run with either the "do nothing function" in the current benchmark, or a function that calculates the factorial of 10.

Modified bm_concurrent_imap/run_benchmark.py

```python """ Benchmark for concurrent model communication. """ import pyperf from multiprocessing.pool import Pool, ThreadPool def f(x: int) -> int: return x def fact(x: int) -> int: n = 10 fact = 1 while n > 0: fact = fact * n n = n - 1 return fact def bench_thread_pool(c: int, n: int, chunk: int) -> None: with ThreadPool(c) as pool: x = 0 for y in pool.imap(f, range(n), chunk): x += y def bench_thread_pool_fact(c: int, n: int, chunk: int) -> None: with ThreadPool(c) as pool: x = 0 for y in pool.imap(fact, range(n), chunk): x += y if __name__ == "__main__": runner = pyperf.Runner() runner.metadata["description"] = "concurrent model communication benchmark" count = 100000 num_core = 8 for chunk in (10, 100, 1000, 10000, 100000): runner.bench_func( f"bench_thread_pool{chunk}", bench_thread_pool, num_core, count, chunk ) for chunk in (10, 100, 1000, 10000, 100000): runner.bench_func( f"bench_thread_pool_fact{chunk}", bench_thread_pool_fact, num_core, count, chunk ) ```

As chunk size increases, nogil does better, but still the best is about 40% slower.

When the threads each do actual work, you can see the effect of nogil, where if the chunk size is right it's 60% faster, but that seems to tap out as chunk size increases (I don't really know how to explain that yet).

Figure_1

It's fair to say this is not a great benchmark for measuring the effect of the GIL -- it's intention (I would assume) was to measure the overhead of ThreadPool.imap. It's still interesting nonetheless, perhaps, as a significant unintentional regression.

faster-cpython / ideas

Investigate why nogil performs poorly on `bench_thread_pool` #586