RelationalAI-oss / MultithreadingBenchmarks.jl

MIT License
5 stars 0 forks source link

[WIP] One thread allocating parallelism scaling experiment #4

Open NHDaly opened 5 years ago

NHDaly commented 5 years ago

In this experiment, we are trying to measure the impact that one thread repeatedly allocating garbage (and running GC) would have on the parallelism performance scaling if all the rest of the threads are independent, non-allocating, non-contentious tasks.

To achieve this, my first several (failed) approaches were to try to have some percentage of the work perform allocations if it is running on a particular thread. This turned out to be problematic for several reasons, but the main reason is:

To solve this, I rewrote this benchmark to have one thread fully occupied running a task in the background that infinite loops constantly allocating garbage. Then I run all benchmark experiments, making sure to record that there was one fewer threads available for the benchmark than we passed in the parameter to Julia.

This allows us to measure the effect that the garbage collector has on "simple" CPU-bound work.

This experiment lives here: src/bench/onethread_alloc_garbage.jl

Tracking issue: #1

NHDaly commented 5 years ago

Results from benchmark:

Full output from 96 vCPU machine (r5d.24xlarge) at 7ff486, running "simple_independent" and "onethread_alloc_garbage": https://gist.github.com/NHDaly/88440458ff828107ebe33b49ee646d5b

Speedup plot per thread: onethread_alloc_garbage-speedup_plot-96vCPU The speedup per thread is nearly linear, as in the "independent tasks" experiment.

You'll notice that the number of threads is 1 fewer than in the "independent tasks" experiment, and that's because one of the threads is always fully-occupied running the allocation infinite-loop in the background, so i subtracted it from the totals.

However, one important thing to notice is that the absolute time is significantly slower than the absolute time for the "independent tasks" experiment, which surprised me. I expected the absolute time to be approximately the same, because the work is cpu-only with no preemption points, so it shouldn't be slowed down by GC. As I'll discuss below, it turns out that this isn't true, because task switching is a GC safepoint, and so even though each individual query wasn't slowed down by the background task, the overall test was.

Here are the numbers from this experiment:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running benchmark for /tmp/nix-build-multithreading.drv-0/multithreading/.julia/packages/MultithreadingBenchmarks/Lseig/src/bench/onethread_alloc_garbage.jl
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
nthreads_to_test = [2, 10, 19, 29, 38, 48, 58, 67, 77, 86, 96]
nqueries = 1000
num_ops = 100000000
...
Results (omitted printing latencies column):
11×6 DataFrames.DataFrame
│ Row │ nthreads │ time_secs │ allocs  │ memory    │ gctime_secs │ thread_counts                                                                                                                                                                                                                                                                                                                                                                                   │
│     │ Int64    │ Float64   │ Int64   │ Int64     │ Float64     │ Array{Int64,1}                                                                                                                                                                                                                                                                                                                                                                                  │
├─────┼──────────┼───────────┼─────────┼───────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 1   │ 1        │ 80.0484   │ 1546581 │ 115675728 │ 50.7427     │ [1000, 0]                                                                                                                                                                                                                                                                                                                                                                                       │
│ 2   │ 9        │ 8.42852   │ 180155  │ 13677008  │ 4.87222     │ [110, 111, 0, 111, 112, 112, 111, 111, 111, 111]                                                                                                                                                                                                                                                                                                                                                │
│ 3   │ 18       │ 4.80347   │ 94020   │ 7246384   │ 2.97226     │ [55, 55, 0, 55, 56, 56, 56, 55, 55, 55, 55, 56, 56, 56, 56, 56, 55, 56, 56]                                                                                                                                                                                                                                                                                                                     │
│ 4   │ 28       │ 3.16551   │ 63256   │ 4949728   │ 2.00855     │ [35, 36, 36, 0, 36, 36, 36, 36, 35, 36, 36, 36, 35, 36, 36, 35, 36, 36, 36, 35, 36, 36, 35, 36, 36, 36, 35, 35, 36]                                                                                                                                                                                                                                                                             │
│ 5   │ 37       │ 2.19969   │ 50938   │ 4030320   │ 1.30392     │ [26, 27, 27, 0, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28, 27, 27]                                                                                                                                                                                                                                         │
│ 6   │ 47       │ 1.81857   │ 41788   │ 3346576   │ 1.0913      │ [21, 21, 22, 0, 21, 21, 22, 21, 22, 21, 21, 21, 22, 21, 21, 21, 21, 21, 21, 21, 22, 22, 21, 21, 22, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 21, 21, 22, 21, 21, 22, 22, 22, 21, 21, 21]                                                                                                                                                                                                 │
│ 7   │ 57       │ 2.32832   │ 35611   │ 2886352   │ 1.14424     │ [17, 17, 17, 0, 18, 17, 18, 17, 18, 17, 17, 17, 18, 17, 17, 18, 18, 18, 18, 17, 18, 18, 18, 18, 17, 18, 18, 17, 17, 17, 17, 18, 17, 17, 18, 18, 18, 17, 18, 17, 18, 18, 17, 17, 18, 18, 18, 18, 18, 18, 17, 18, 17, 17, 18, 18, 17, 18]                                                                                                                                                         │
│ 8   │ 66       │ 2.09347   │ 32461   │ 2651120   │ 1.03816     │ [14, 15, 15, 0, 15, 15, 16, 15, 15, 15, 16, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 15, 15, 16, 16, 16, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 15, 15, 15, 16, 15, 15, 15, 15, 16, 15, 16, 15, 15, 15, 15, 15, 15, 15, 16, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]                                                                                                                     │
│ 9   │ 76       │ 1.88307   │ 29444   │ 2426224   │ 0.957702    │ [12, 13, 13, 14, 13, 14, 13, 13, 0, 13, 13, 13, 14, 13, 14, 13, 13, 13, 13, 14, 13, 13, 13, 14, 13, 13, 13, 13, 14, 13, 13, 14, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 13, 14, 13, 14, 13, 13, 13, 14, 13, 13, 13, 13, 13, 13, 13, 13]                                                                             │
│ 10  │ 85       │ 1.69875   │ 26406   │ 2199472   │ 0.906481    │ [11, 0, 12, 12, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 11, 12, 12, 12, 12, 12, 11, 12, 12, 11, 12, 12, 12, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 11, 12, 11, 11, 12, 12, 12, 12, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 11, 12, 12, 12, 12, 12, 11, 12, 11, 12, 12, 11, 12, 11, 11, 12, 12, 12, 12, 12, 12, 12, 11, 12, 11, 12]                                         │
│ 11  │ 95       │ 1.73209   │ 24843   │ 2082752   │ 0.989978    │ [10, 11, 10, 11, 0, 10, 11, 11, 10, 11, 10, 10, 10, 10, 11, 11, 11, 11, 11, 10, 10, 10, 11, 10, 10, 11, 11, 11, 11, 11, 11, 11, 10, 10, 11, 11, 11, 11, 11, 10, 11, 11, 10, 10, 10, 11, 11, 10, 10, 10, 11, 10, 10, 10, 10, 11, 11, 10, 11, 11, 10, 11, 11, 10, 10, 11, 10, 10, 10, 11, 11, 10, 11, 11, 11, 10, 11, 10, 10, 10, 10, 11, 10, 11, 11, 11, 10, 11, 11, 11, 11, 11, 10, 10, 10, 10] │
Processed:
11×6 DataFrames.DataFrame
│ Row │ nthreads │ time_secs │ speedup_factor │ marginal_speedup │ utilization │ latency_quantiles_ms                                                  │
│     │ Int64    │ Float64   │ Float64        │ Float64          │ Float64     │ Tuple…                                                                │
├─────┼──────────┼───────────┼────────────────┼──────────────────┼─────────────┼───────────────────────────────────────────────────────────────────────┤
│ 1   │ 1        │ 80.0484   │ 1.0            │ 1.0              │ 1.0         │ (0.1=>29.1989, 0.5=>29.2435, 0.9=>29.2631, 0.99=>30.3812, 1=>31.7797) │
│ 2   │ 9        │ 8.42852   │ 9.49733        │ 1.06217          │ 1.05526     │ (0.1=>31.6551, 0.5=>31.6647, 0.9=>31.7113, 0.99=>31.7337, 1=>31.8091) │
│ 3   │ 18       │ 4.80347   │ 16.6647        │ 0.796374         │ 0.925816    │ (0.1=>31.6531, 0.5=>31.6616, 0.9=>31.7233, 0.99=>32.0455, 1=>32.1471) │
│ 4   │ 28       │ 3.16551   │ 25.2877        │ 0.862301         │ 0.903132    │ (0.1=>31.6687, 0.5=>31.679, 0.9=>31.746, 0.99=>32.1956, 1=>32.5385)   │
│ 5   │ 37       │ 2.19969   │ 36.3908        │ 1.23368          │ 0.983535    │ (0.1=>31.6839, 0.5=>31.7069, 0.9=>31.7723, 0.99=>31.8479, 1=>31.8632) │
│ 6   │ 47       │ 1.81857   │ 44.0171        │ 0.762631         │ 0.936534    │ (0.1=>32.5251, 0.5=>32.5366, 0.9=>32.5482, 0.99=>32.8122, 1=>32.841)  │
│ 7   │ 57       │ 2.32832   │ 34.3804        │ -0.963674        │ 0.603164    │ (0.1=>32.5371, 0.5=>32.5615, 0.9=>65.1985, 0.99=>65.6788, 1=>65.7018) │
│ 8   │ 66       │ 2.09347   │ 38.2372        │ 0.428536         │ 0.579351    │ (0.1=>32.5382, 0.5=>65.179, 0.9=>65.2023, 0.99=>65.5688, 1=>65.6052)  │
│ 9   │ 76       │ 1.88307   │ 42.5094        │ 0.427225         │ 0.559335    │ (0.1=>32.5307, 0.5=>65.175, 0.9=>65.1907, 0.99=>65.6591, 1=>65.6833)  │
│ 10  │ 85       │ 1.69875   │ 47.122         │ 0.512509         │ 0.554377    │ (0.1=>32.5847, 0.5=>65.1401, 0.9=>65.1681, 0.99=>65.5336, 1=>65.5616) │
│ 11  │ 95       │ 1.73209   │ 46.2149        │ -0.0907092       │ 0.486473    │ (0.1=>65.1665, 0.5=>65.1854, 0.9=>65.2077, 0.99=>65.6888, 1=>79.9795) │

Compared with the results from this run of the "independent tasks" experiment, of the same size:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running benchmark for /tmp/nix-build-multithreading.drv-0/multithreading/.julia/packages/MultithreadingBenchmarks/Lseig/src/bench/simple_independent.jl
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
------ TEST PARAMETERS ------
nthreads_to_test = [1, 10, 19, 29, 38, 48, 58, 67, 77, 86, 96]
nqueries = 1000
num_ops = 100000000
...
Results (omitted printing latencies column):
11×6 DataFrames.DataFrame
│ Row │ nthreads │ time_secs │ allocs │ memory │ gctime_secs │ thread_counts                                                                                                                                                                                                                                                                                                                                                                                    │
│     │ Int64    │ Float64   │ Int64  │ Int64  │ Float64     │ Array{Int64,1}                                                                                                                                                                                                                                                                                                                                                                                   │
├─────┼──────────┼───────────┼────────┼────────┼─────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 1   │ 1        │ 29.2143   │ 7507   │ 817296 │ 0.0         │ [1000]                                                                                                                                                                                                                                                                                                                                                                                           │
│ 2   │ 10       │ 3.16927   │ 8024   │ 825696 │ 0.0         │ [100, 100, 100, 100, 100, 100, 100, 100, 100, 100]                                                                                                                                                                                                                                                                                                                                               │
│ 3   │ 19       │ 1.67866   │ 8025   │ 825872 │ 0.0         │ [52, 53, 53, 53, 52, 53, 52, 53, 53, 53, 53, 52, 53, 52, 53, 53, 52, 53, 52]                                                                                                                                                                                                                                                                                                                     │
│ 4   │ 29       │ 1.10927   │ 8025   │ 826064 │ 0.0         │ [34, 35, 35, 34, 35, 34, 35, 35, 35, 34, 35, 35, 35, 34, 34, 35, 34, 34, 35, 34, 34, 35, 34, 34, 35, 34, 35, 34, 34]                                                                                                                                                                                                                                                                             │
│ 5   │ 38       │ 0.856075  │ 8023   │ 826160 │ 0.0         │ [26, 26, 27, 27, 27, 26, 27, 26, 27, 26, 26, 26, 26, 26, 26, 26, 27, 27, 26, 26, 26, 26, 27, 26, 26, 27, 27, 26, 26, 26, 26, 26, 27, 26, 26, 27, 26, 26]                                                                                                                                                                                                                                         │
│ 6   │ 48       │ 0.716421  │ 8040   │ 826624 │ 0.0         │ [21, 22, 22, 22, 21, 22, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 21, 21, 21, 21, 21, 21, 21, 22, 22, 21, 21, 21, 21, 11, 21, 21, 21, 21, 22, 21, 21, 21, 22, 11, 21, 22, 22, 22, 21]                                                                                                                                                                                                 │
│ 7   │ 58       │ 0.686355  │ 8023   │ 826448 │ 0.0         │ [10, 21, 21, 10, 10, 21, 21, 21, 10, 11, 21, 21, 21, 21, 21, 21, 10, 21, 21, 21, 21, 21, 21, 21, 11, 21, 10, 21, 21, 10, 10, 10, 10, 20, 21, 21, 21, 21, 21, 21, 21, 21, 10, 10, 21, 21, 21, 21, 10, 11, 10, 21, 21, 10, 21, 10, 21, 10]                                                                                                                                                         │
│ 8   │ 67       │ 0.711044  │ 8023   │ 826608 │ 0.0         │ [10, 10, 10, 21, 21, 10, 10, 11, 10, 10, 10, 21, 11, 20, 10, 11, 21, 11, 11, 21, 21, 20, 21, 20, 21, 11, 20, 21, 21, 20, 21, 11, 21, 21, 20, 21, 20, 21, 11, 10, 11, 11, 20, 11, 11, 20, 21, 21, 21, 21, 11, 11, 10, 10, 10, 11, 10, 21, 10, 11, 10, 11, 11, 10, 10, 11, 11]                                                                                                                     │
│ 9   │ 77       │ 0.715417  │ 8022   │ 826816 │ 0.0         │ [10, 10, 11, 11, 11, 21, 11, 10, 10, 21, 21, 21, 10, 21, 11, 21, 21, 21, 11, 21, 11, 10, 11, 10, 10, 10, 21, 21, 10, 11, 10, 10, 10, 10, 21, 10, 21, 10, 21, 10, 11, 10, 10, 10, 11, 21, 10, 10, 10, 10, 11, 21, 21, 11, 21, 10, 10, 20, 10, 10, 11, 10, 10, 10, 11, 10, 11, 10, 10, 11, 11, 11, 11, 10, 10, 11, 11]                                                                             │
│ 10  │ 86       │ 0.716089  │ 8023   │ 826992 │ 0.0         │ [10, 10, 10, 11, 11, 21, 11, 10, 11, 11, 10, 10, 10, 10, 21, 21, 21, 21, 21, 10, 11, 10, 10, 10, 11, 11, 10, 11, 21, 11, 10, 10, 11, 10, 11, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 10, 10, 21, 10, 11, 10, 11, 11, 11, 10, 10, 10, 10, 11, 11, 11, 10, 10, 10, 10, 10, 10, 10, 11, 10, 11, 11, 11, 10, 21, 11, 11, 10, 10, 10, 10, 20, 11, 11, 11, 10]                                         │
│ 11  │ 96       │ 0.716366  │ 8029   │ 827248 │ 0.0         │ [10, 10, 10, 10, 11, 10, 11, 11, 10, 11, 11, 10, 11, 11, 11, 10, 11, 11, 11, 10, 11, 11, 10, 10, 10, 10, 10, 10, 11, 10, 10, 10, 10, 10, 10, 10, 11, 10, 11, 11, 11, 10, 10, 10, 11, 10, 10, 11, 10, 10, 11, 10, 10, 10, 11, 10, 10, 11, 11, 11, 10, 10, 10, 11, 11, 11, 10, 10, 11, 10, 11, 10, 11, 11, 11, 10, 10, 11, 11, 10, 10, 11, 10, 10, 10, 10, 10, 11, 11, 11, 11, 10, 10, 10, 10, 10] │
Processed:
11×6 DataFrames.DataFrame
│ Row │ nthreads │ time_secs │ speedup_factor │ marginal_speedup │ utilization │ latency_quantiles_ms                                                  │
│     │ Int64    │ Float64   │ Float64        │ Float64          │ Float64     │ Tuple…                                                                │
├─────┼──────────┼───────────┼────────────────┼──────────────────┼─────────────┼───────────────────────────────────────────────────────────────────────┤
│ 1   │ 1        │ 29.2143   │ 1.0            │ 1.0              │ 1.0         │ (0.1=>29.1484, 0.5=>29.2106, 0.9=>29.2312, 0.99=>29.4878, 1=>30.308)  │
│ 2   │ 10       │ 3.16927   │ 9.21799        │ 0.91311          │ 0.921799    │ (0.1=>31.6552, 0.5=>31.6656, 0.9=>31.71, 0.99=>31.7365, 1=>31.8281)   │
│ 3   │ 19       │ 1.67866   │ 17.4034        │ 0.909487         │ 0.915967    │ (0.1=>31.649, 0.5=>31.657, 0.9=>31.7173, 0.99=>31.7328, 1=>31.7499)   │
│ 4   │ 29       │ 1.10927   │ 26.3365        │ 0.893316         │ 0.908156    │ (0.1=>31.6773, 0.5=>31.6879, 0.9=>31.7502, 0.99=>32.0896, 1=>44.0836) │
│ 5   │ 38       │ 0.856075  │ 34.1258        │ 0.865479         │ 0.898048    │ (0.1=>31.6689, 0.5=>31.6894, 0.9=>31.7598, 0.99=>32.1105, 1=>32.1424) │
│ 6   │ 48       │ 0.716421  │ 40.7781        │ 0.665223         │ 0.849543    │ (0.1=>32.5396, 0.5=>32.5513, 0.9=>32.5761, 0.99=>65.1689, 1=>65.1898) │
│ 7   │ 58       │ 0.686355  │ 42.5644        │ 0.178629         │ 0.733868    │ (0.1=>32.5143, 0.5=>32.5291, 0.9=>65.1312, 0.99=>65.3504, 1=>65.3945) │
│ 8   │ 67       │ 0.711044  │ 41.0864        │ -0.164214        │ 0.61323     │ (0.1=>32.511, 0.5=>32.5417, 0.9=>65.1564, 0.99=>65.4163, 1=>65.4925)  │
│ 9   │ 77       │ 0.715417  │ 40.8353        │ -0.0251139       │ 0.530329    │ (0.1=>32.487, 0.5=>65.1362, 0.9=>65.1737, 0.99=>65.6643, 1=>65.6945)  │
│ 10  │ 86       │ 0.716089  │ 40.797         │ -0.00425995      │ 0.474383    │ (0.1=>32.4827, 0.5=>65.1294, 0.9=>65.1463, 0.99=>65.5079, 1=>65.5288) │
│ 11  │ 96       │ 0.716366  │ 40.7812        │ -0.00157556      │ 0.424804    │ (0.1=>65.135, 0.5=>65.1488, 0.9=>65.1706, 0.99=>65.6776, 1=>65.7385)  │

The important things to notice are:


All of the above information is consistent with what we'd expect given the current implementation of the GC.

Julia's GC is a "freeze the world" GC, which means that no one is allowed to be executing julia code in any thread while we run garbage collection. This means that if any one thread tries to run GC, it waits until all threads have yielded. In our case, this means that while each query is able to run to completion without being affected, we don't schedule new queries until all currently-executing queries have finished and then we run and finish GC.

Also, it's worth noting that currently julia doesn't count the time spent waiting for threads to finish in the reported gc_time, so the above timing is actual GC time that occurred in-between queries.


The other graphs we produced are here: onethread_alloc_garbage-abs_time_plot-96vCPUs onethread_alloc_garbage-thread_counts-96vCPU

The thread-counts plot is neat because you can clearly see that the work was still distributed evenly across all threads, and the background thread (always thread #3, it seems) never runs any work.

NHDaly commented 5 years ago

I've opened https://github.com/JuliaLang/julia/issues/33097 to discuss these results and brainstorm ways to mitigate the GC impact.

In this benchmark, almost all the slowdown was simply caused by the actual time it took to do GC, running on the critical path. Making GC faster (e.g. by multithreading the GC implementation) would help in our case.

If the benchmark had had longer-running tasks or especially variable-length tasks, we may have also started to see the costs from the forced synchronization-points contribute to the runtime as well. I have reproduced this in some local experiments on my machine. For that problem, adding GC safepoints (exposed in https://github.com/JuliaLang/julia/pull/33092) to the long-running cpu-bound tasks would help.

The biggest help would be to move GC off of our critical path, by somehow marking the cpu-bound region as GC-safe, and allowing GC to proceed in the background, but this was judged to be difficult to support, at least in the short-term.