Open RSalman opened 4 years ago
Another thing to consider is the time to notify threads. Currently, this is dependent on the size of the pool as it's done with notify_all
. This can be costly if we want to use a small subset of the pool. Waking up 2 thread vs. 64 threads is the same, the time is expected to be constant This can be seen by measuring time to notify various thread counts:
------ Thread Count: 64 -> Time to Notify: 212 ------
------ Thread Count: 63 -> Time to Notify: 189 ------
------ Thread Count: 62 -> Time to Notify: 189 ------
------ Thread Count: 61 -> Time to Notify: 211 ------
------ Thread Count: 60 -> Time to Notify: 242 ------
------ Thread Count: 59 -> Time to Notify: 261 ------
------ Thread Count: 58 -> Time to Notify: 269 ------
------ Thread Count: 57 -> Time to Notify: 249 ------
------ Thread Count: 56 -> Time to Notify: 253 ------
...
------ Thread Count: 10 -> Time to Notify: 252 ------
------ Thread Count: 9 -> Time to Notify: 253 ------
------ Thread Count: 8 -> Time to Notify: 255 ------
------ Thread Count: 7 -> Time to Notify: 254 ------
------ Thread Count: 6 -> Time to Notify: 254 ------
------ Thread Count: 5 -> Time to Notify: 258 ------
------ Thread Count: 4 -> Time to Notify: 253 ------
------ Thread Count: 3 -> Time to Notify: 249 ------
------ Thread Count: 2 -> Time to Notify: 254 ------
------ Thread Count: 1 -> Time to Notify: 260 ------
If we do a notify for each thread:
for (uintptr_t threads = 0 ; threads < count; threads++ ) {
omrthread_monitor_notify(_slaveThreadMutex);
}
This results in the following:
------ Thread Count: 60 -> Time to Notify: 643 ------
------ Thread Count: 59 -> Time to Notify: 653 ------
------ Thread Count: 58 -> Time to Notify: 643 ------
------ Thread Count: 57 -> Time to Notify: 611 ------
------ Thread Count: 56 -> Time to Notify: 576 ------
------ Thread Count: 55 -> Time to Notify: 548 ------
------ Thread Count: 54 -> Time to Notify: 672 ------
------ Thread Count: 53 -> Time to Notify: 804 ------
------ Thread Count: 52 -> Time to Notify: 578 ------
...
------ Thread Count: 14 -> Time to Notify: 134 ------
------ Thread Count: 13 -> Time to Notify: 128 ------
------ Thread Count: 12 -> Time to Notify: 133 ------
------ Thread Count: 11 -> Time to Notify: 105 ------
------ Thread Count: 10 -> Time to Notify: 100 ------
------ Thread Count: 9 -> Time to Notify: 77 ------
------ Thread Count: 8 -> Time to Notify: 69 ------
------ Thread Count: 7 -> Time to Notify: 57 ------
------ Thread Count: 6 -> Time to Notify: 46 ------
------ Thread Count: 5 -> Time to Notify: 36 ------
------ Thread Count: 4 -> Time to Notify: 24 ------
------ Thread Count: 3 -> Time to Notify: 15 ------
------ Thread Count: 2 -> Time to Notify: 8 ------
------ Thread Count: 1 -> Time to Notify: 0 ------
This approach is better when waking up threads < ~20, but its’s worse off with more threads. We see notification times up-to 500-650us which decrease as thread count decreases. We obtain similar times of notify_all when threads are cut in half (~32 threads of 64). But the clear advantage is when threads are < ~20, we see times <100us.
Currently, the thread tables are updated to select the threads prior to waking up the threads:
With the current implantation, we obtain the following average times to start task (start collection) and time to notify:
Thread Count: 48 -> Time to Notify: 302 -> Average Time to Start Collection: 539
Thread Count: 47 -> Time to Notify: 294 -> Average Time to Start Collection: 512
Thread Count: 46 -> Time to Notify: 298 -> Average Time to Start Collection: 546
Thread Count: 45 -> Time to Notify: 296 -> Average Time to Start Collection: 500
Thread Count: 44 -> Time to Notify: 293 -> Average Time to Start Collection: 472
Thread Count: 43 -> Time to Notify: 291 -> Average Time to Start Collection: 1011
Thread Count: 42 -> Time to Notify: 291 -> Average Time to Start Collection: 470
Thread Count: 41 -> Time to Notify: 294 -> Average Time to Start Collection: 467
Thread Count: 40 -> Time to Notify: 287 -> Average Time to Start Collection: 972
Thread Count: 39 -> Time to Notify: 293 -> Average Time to Start Collection: 806
Thread Count: 38 -> Time to Notify: 291 -> Average Time to Start Collection: 988
Thread Count: 37 -> Time to Notify: 309 -> Average Time to Start Collection: 657
Thread Count: 36 -> Time to Notify: 301 -> Average Time to Start Collection: 970
Thread Count: 35 -> Time to Notify: 293 -> Average Time to Start Collection: 936
Thread Count: 34 -> Time to Notify: 292 -> Average Time to Start Collection: 995
Thread Count: 33 -> Time to Notify: 293 -> Average Time to Start Collection: 1093
Thread Count: 32 -> Time to Notify: 323 -> Average Time to Start Collection: 978
Thread Count: 31 -> Time to Notify: 306 -> Average Time to Start Collection: 1024
Thread Count: 30 -> Time to Notify: 309 -> Average Time to Start Collection: 967
Thread Count: 29 -> Time to Notify: 311 -> Average Time to Start Collection: 1039
Thread Count: 28 -> Time to Notify: 313 -> Average Time to Start Collection: 1019
Thread Count: 27 -> Time to Notify: 313 -> Average Time to Start Collection: 988
Thread Count: 26 -> Time to Notify: 305 -> Average Time to Start Collection: 1015
Thread Count: 25 -> Time to Notify: 295 -> Average Time to Start Collection: 1054
Thread Count: 24 -> Time to Notify: 296 -> Average Time to Start Collection: 819
Thread Count: 23 -> Time to Notify: 312 -> Average Time to Start Collection: 926
Thread Count: 22 -> Time to Notify: 296 -> Average Time to Start Collection: 919
Thread Count: 21 -> Time to Notify: 289 -> Average Time to Start Collection: 1034
Thread Count: 20 -> Time to Notify: 289 -> Average Time to Start Collection: 1068
Thread Count: 19 -> Time to Notify: 293 -> Average Time to Start Collection: 1053
Thread Count: 18 -> Time to Notify: 300 -> Average Time to Start Collection: 1096
Thread Count: 17 -> Time to Notify: 302 -> Average Time to Start Collection: 910
Thread Count: 16 -> Time to Notify: 308 -> Average Time to Start Collection: 1010
Thread Count: 15 -> Time to Notify: 298 -> Average Time to Start Collection: 968
Thread Count: 14 -> Time to Notify: 301 -> Average Time to Start Collection: 1092
Thread Count: 13 -> Time to Notify: 301 -> Average Time to Start Collection: 908
Thread Count: 12 -> Time to Notify: 299 -> Average Time to Start Collection: 902
Thread Count: 11 -> Time to Notify: 299 -> Average Time to Start Collection: 888
Thread Count: 10 -> Time to Notify: 296 -> Average Time to Start Collection: 926
Thread Count: 9 -> Time to Notify: 291 -> Average Time to Start Collection: 723
Thread Count: 8 -> Time to Notify: 290 -> Average Time to Start Collection: 945
Thread Count: 7 -> Time to Notify: 296 -> Average Time to Start Collection: 1023
Thread Count: 6 -> Time to Notify: 324 -> Average Time to Start Collection: 1102
Thread Count: 5 -> Time to Notify: 301 -> Average Time to Start Collection: 993
Thread Count: 4 -> Time to Notify: 311 -> Average Time to Start Collection: 569
Thread Count: 3 -> Time to Notify: 324 -> Average Time to Start Collection: 1062
Thread Count: 2 -> Time to Notify: 310 -> Average Time to Start Collection: 601
Thread Count: 1 -> Time to Notify: 308 -> Average Time to Start Collection: 644
To get around delays in selected threads starting the task (and contention between selected and unselected threads), we can use threads in the order which they are available (wake up / reacquire the mutex) and forgo selecting threads beforehand. With this, we obtain the following results:
Thread Count: 48 -> Time to Notify: 288 -> Average Time to Start Collection: 470
Thread Count: 47 -> Time to Notify: 300 -> Average Time to Start Collection: 498
Thread Count: 46 -> Time to Notify: 281 -> Average Time to Start Collection: 497
Thread Count: 45 -> Time to Notify: 284 -> Average Time to Start Collection: 439
Thread Count: 44 -> Time to Notify: 293 -> Average Time to Start Collection: 446
Thread Count: 43 -> Time to Notify: 283 -> Average Time to Start Collection: 520
Thread Count: 42 -> Time to Notify: 280 -> Average Time to Start Collection: 425
Thread Count: 41 -> Time to Notify: 283 -> Average Time to Start Collection: 434
Thread Count: 40 -> Time to Notify: 282 -> Average Time to Start Collection: 432
Thread Count: 39 -> Time to Notify: 283 -> Average Time to Start Collection: 412
Thread Count: 38 -> Time to Notify: 291 -> Average Time to Start Collection: 470
Thread Count: 37 -> Time to Notify: 285 -> Average Time to Start Collection: 501
Thread Count: 36 -> Time to Notify: 282 -> Average Time to Start Collection: 427
Thread Count: 35 -> Time to Notify: 280 -> Average Time to Start Collection: 401
Thread Count: 34 -> Time to Notify: 293 -> Average Time to Start Collection: 392
Thread Count: 33 -> Time to Notify: 283 -> Average Time to Start Collection: 377
Thread Count: 32 -> Time to Notify: 289 -> Average Time to Start Collection: 443
Thread Count: 31 -> Time to Notify: 281 -> Average Time to Start Collection: 419
Thread Count: 30 -> Time to Notify: 295 -> Average Time to Start Collection: 462
Thread Count: 29 -> Time to Notify: 282 -> Average Time to Start Collection: 408
Thread Count: 28 -> Time to Notify: 284 -> Average Time to Start Collection: 431
Thread Count: 27 -> Time to Notify: 294 -> Average Time to Start Collection: 430
Thread Count: 26 -> Time to Notify: 280 -> Average Time to Start Collection: 389
Thread Count: 25 -> Time to Notify: 281 -> Average Time to Start Collection: 367
Thread Count: 24 -> Time to Notify: 281 -> Average Time to Start Collection: 362
Thread Count: 23 -> Time to Notify: 280 -> Average Time to Start Collection: 377
Thread Count: 22 -> Time to Notify: 277 -> Average Time to Start Collection: 360
Thread Count: 21 -> Time to Notify: 281 -> Average Time to Start Collection: 366
Thread Count: 20 -> Time to Notify: 283 -> Average Time to Start Collection: 377
Thread Count: 19 -> Time to Notify: 282 -> Average Time to Start Collection: 470
Thread Count: 18 -> Time to Notify: 282 -> Average Time to Start Collection: 376
Thread Count: 17 -> Time to Notify: 279 -> Average Time to Start Collection: 342
Thread Count: 16 -> Time to Notify: 277 -> Average Time to Start Collection: 391
Thread Count: 15 -> Time to Notify: 273 -> Average Time to Start Collection: 368
Thread Count: 14 -> Time to Notify: 293 -> Average Time to Start Collection: 341
Thread Count: 13 -> Time to Notify: 286 -> Average Time to Start Collection: 388
Thread Count: 12 -> Time to Notify: 278 -> Average Time to Start Collection: 376
Thread Count: 11 -> Time to Notify: 276 -> Average Time to Start Collection: 315
Thread Count: 10 -> Time to Notify: 277 -> Average Time to Start Collection: 369
Thread Count: 9 -> Time to Notify: 295 -> Average Time to Start Collection: 440
Thread Count: 8 -> Time to Notify: 279 -> Average Time to Start Collection: 313
Thread Count: 7 -> Time to Notify: 284 -> Average Time to Start Collection: 395
Thread Count: 6 -> Time to Notify: 281 -> Average Time to Start Collection: 408
Thread Count: 5 -> Time to Notify: 311 -> Average Time to Start Collection: 473
Thread Count: 4 -> Time to Notify: 287 -> Average Time to Start Collection: 309
Thread Count: 3 -> Time to Notify: 284 -> Average Time to Start Collection: 395
Thread Count: 2 -> Time to Notify: 284 -> Average Time to Start Collection: 456
Thread Count: 1 -> Time to Notify: 287 -> Average Time to Start Collection: 309
With this approach, we see significantly lower Average Times to Start Collection
fyi @amicic
If we activate the first Worker threads that wake up at initial dispatch, it means that their IDs will not be contiguous?
Not sure if that's important, other than for easier debugging, but if we want them to be contiguous that's probably doable by adding an extra table in the Dispatcher that maps their ID from the pool of Workers into the (contiguous set of) IDs of active Worker threads.
That mapping would be initialized as the threads wake up.
If we activate the first Worker threads that wake up at initial dispatch, it means that their IDs will not be contiguous?
This is true, it's dependent on the order they wake up.
Not sure if that's important, other than for easier debugging
It's only noticeable for debugging purposes. For instance, Tgc:parallel logs can look something like this (using 8/48 threads):
Scavenger parallel and progress stats, timestamp="2020-06-23T16:16:27.900"
gc thrID busy stall ....
(micros) (micros) ....
SCV.T 14 0 308 196 ....
SCV.T 14 3 340 164 ....
SCV.T 14 4 299 205 ....
SCV.T 14 6 299 205 ....
SCV.T 14 43 299 205 ....
SCV.T 14 44 407 97 ....
SCV.T 14 45 311 193 ....
SCV.T 14 47 421 83 ....
Each platform is tested with two different machines, a machine with a large core count and another with a comparatively smaller core count. Benefits of these optimizations are dependent on the machine size. For instance, the overhead of notify_all
is a function of the thread pool size and the # of available h/w threads determines the default pool size.
These tables only present a single sample for each thread count (i.e., only 1 GC is measured with that thread count). These tables are only intended to show relative magnitude, there can variations. Full set of logs are attached (dispatcher-opt.zip) with multiple samples for each thread count.
Without considering run-to-run variance, the scores look higher for the optimized dispatcher runs.
We can see clear performance gains when comparing collection times. Below are graphs showing the first ~300 collections from a baseline and an optimized run of the benchmark. We expect to see most gains when the heap is not fully expanded or when the heap contracts. The # of utilized threads is a function of heap size, during initial stages/start up the heap is still expanding hence the optimization would benefit here the most.
Avg. Collection Time Baseline: ~2.9ms Avg. Collection Time Opt-Dispatcher: ~1.7ms
The current implementation of thread dispatching for parallel GC tasks is inefficient.
Threads are pre-selected for the task. This is not optimal since an unselected thread may become available earlier than a selected thread (may take some more time, depends on system when thread is made available, or order in which thread reacquires monitor)
From the pool of slave threads, all threads are woken up regardless if they're selected or not to perform the task, as noted in the code:
https://github.com/eclipse/omr/blob/9e5dfc852daa61a2347183b387ca75e3a637d998/gc/base/ParallelDispatcher.cpp#L381-L386
The current implementation simply does a
notify_all
and the count is irrelevant in the actual wake up notification: https://github.com/eclipse/omr/blob/9e5dfc852daa61a2347183b387ca75e3a637d998/gc/base/ParallelDispatcher.cpp#L387-L391Given that threads are pre-selected and that all threads are notified, we can have contention between selected and unselected threads. All threads wait on the same mutex, they need to reacquire the mutex before continuing. This means threads proceed in a serial fashion after being notified. The mutex is made available when the owning thread either
For instance, if a single thread needs to be used for the task, it can be the last thread to acquire the mutex (or it may not be available until some later time). In this case it must wait for all other threads in the pool to sleep again before continuing.
These effects can be seen here:
Here we can see that it takes around ~1.5ms for 1 selected thread (out of 64 threads) to startup. Ideally, the selected thread should start shortly after notification.
An ideal scenario looks like this:
Overall, the time it takes to start a task can be dependent on: