eclipse / omr

Eclipse OMR™ Cross platform components for building reliable, high performance language runtimes
http://www.eclipse.org/omr
Other
940 stars 394 forks source link

GC Inefficient Parallel Dispatching #5318

Open RSalman opened 4 years ago

RSalman commented 4 years ago

The current implementation of thread dispatching for parallel GC tasks is inefficient.

  1. Threads are pre-selected for the task. This is not optimal since an unselected thread may become available earlier than a selected thread (may take some more time, depends on system when thread is made available, or order in which thread reacquires monitor)

  2. From the pool of slave threads, all threads are woken up regardless if they're selected or not to perform the task, as noted in the code:

https://github.com/eclipse/omr/blob/9e5dfc852daa61a2347183b387ca75e3a637d998/gc/base/ParallelDispatcher.cpp#L381-L386

The current implementation simply does a notify_all and the count is irrelevant in the actual wake up notification: https://github.com/eclipse/omr/blob/9e5dfc852daa61a2347183b387ca75e3a637d998/gc/base/ParallelDispatcher.cpp#L387-L391

Given that threads are pre-selected and that all threads are notified, we can have contention between selected and unselected threads. All threads wait on the same mutex, they need to reacquire the mutex before continuing. This means threads proceed in a serial fashion after being notified. The mutex is made available when the owning thread either

For instance, if a single thread needs to be used for the task, it can be the last thread to acquire the mutex (or it may not be available until some later time). In this case it must wait for all other threads in the pool to sleep again before continuing.

These effects can be seen here:

------ Thread Count: 2 (total threads 64) -> Time to Notify: 199 us ------
Time from notification start to task start:
[Thread 0] 258 us  //Master thread that did notification, starts collection shortly after dispatch
[Thread 1] 1470 us  //Selected Thread starts tasks much after notification

Here we can see that it takes around ~1.5ms for 1 selected thread (out of 64 threads) to startup. Ideally, the selected thread should start shortly after notification.

An ideal scenario looks like this:

------ Thread Count: 2 (total threads 64) -> Time to Notify: 254 us ------
Time from notification start to task start:
[Thread 0] 265 us  //Master thread that did notification
[Thread 1] 276 us  

Overall, the time it takes to start a task can be dependent on:

  1. Availability of selected thread
  2. Order in which selected thread is woken up / the order it wins the lock
  3. Size of the pool
RSalman commented 4 years ago

Another thing to consider is the time to notify threads. Currently, this is dependent on the size of the pool as it's done with notify_all. This can be costly if we want to use a small subset of the pool. Waking up 2 thread vs. 64 threads is the same, the time is expected to be constant This can be seen by measuring time to notify various thread counts:

------ Thread Count: 64 -> Time to Notify: 212 ------
 ------ Thread Count: 63 -> Time to Notify: 189 ------
 ------ Thread Count: 62 -> Time to Notify: 189 ------
 ------ Thread Count: 61 -> Time to Notify: 211 ------
 ------ Thread Count: 60 -> Time to Notify: 242 ------
 ------ Thread Count: 59 -> Time to Notify: 261 ------
 ------ Thread Count: 58 -> Time to Notify: 269 ------
 ------ Thread Count: 57 -> Time to Notify: 249 ------
 ------ Thread Count: 56 -> Time to Notify: 253 ------
 ...
 ------ Thread Count: 10 -> Time to Notify: 252 ------
 ------ Thread Count: 9 -> Time to Notify: 253 ------
 ------ Thread Count: 8 -> Time to Notify: 255 ------
 ------ Thread Count: 7 -> Time to Notify: 254 ------
 ------ Thread Count: 6 -> Time to Notify: 254 ------
 ------ Thread Count: 5 -> Time to Notify: 258 ------
 ------ Thread Count: 4 -> Time to Notify: 253 ------
 ------ Thread Count: 3 -> Time to Notify: 249 ------
 ------ Thread Count: 2 -> Time to Notify: 254 ------
 ------ Thread Count: 1 -> Time to Notify: 260 ------

If we do a notify for each thread:

for (uintptr_t threads = 0 ; threads < count; threads++ ) {
    omrthread_monitor_notify(_slaveThreadMutex);
}

This results in the following:

------ Thread Count: 60 -> Time to Notify: 643 ------
 ------ Thread Count: 59 -> Time to Notify: 653 ------
 ------ Thread Count: 58 -> Time to Notify: 643 ------
 ------ Thread Count: 57 -> Time to Notify: 611 ------
 ------ Thread Count: 56 -> Time to Notify: 576 ------
 ------ Thread Count: 55 -> Time to Notify: 548 ------
 ------ Thread Count: 54 -> Time to Notify: 672 ------
 ------ Thread Count: 53 -> Time to Notify: 804 ------
 ------ Thread Count: 52 -> Time to Notify: 578 ------
...
 ------ Thread Count: 14 -> Time to Notify: 134 ------
 ------ Thread Count: 13 -> Time to Notify: 128 ------
 ------ Thread Count: 12 -> Time to Notify: 133 ------
 ------ Thread Count: 11 -> Time to Notify: 105 ------
 ------ Thread Count: 10 -> Time to Notify: 100 ------
 ------ Thread Count: 9 -> Time to Notify: 77 ------
 ------ Thread Count: 8 -> Time to Notify: 69 ------
 ------ Thread Count: 7 -> Time to Notify: 57 ------
 ------ Thread Count: 6 -> Time to Notify: 46 ------
 ------ Thread Count: 5 -> Time to Notify: 36 ------
 ------ Thread Count: 4 -> Time to Notify: 24 ------
 ------ Thread Count: 3 -> Time to Notify: 15 ------
 ------ Thread Count: 2 -> Time to Notify: 8 ------
 ------ Thread Count: 1 -> Time to Notify: 0 ------

This approach is better when waking up threads < ~20, but its’s worse off with more threads. We see notification times up-to 500-650us which decrease as thread count decreases. We obtain similar times of notify_all when threads are cut in half (~32 threads of 64). But the clear advantage is when threads are < ~20, we see times <100us.

RSalman commented 4 years ago

Assign task to threads at hand (earliest available)

Currently, the thread tables are updated to select the threads prior to waking up the threads:

https://github.com/eclipse/omr/blob/9e5dfc852daa61a2347183b387ca75e3a637d998/gc/base/ParallelDispatcher.cpp#L478-L482

With the current implantation, we obtain the following average times to start task (start collection) and time to notify:

 Thread Count: 48 -> Time to Notify: 302 -> Average Time to Start Collection: 539 
 Thread Count: 47 -> Time to Notify: 294 -> Average Time to Start Collection: 512 
 Thread Count: 46 -> Time to Notify: 298 -> Average Time to Start Collection: 546 
 Thread Count: 45 -> Time to Notify: 296 -> Average Time to Start Collection: 500 
 Thread Count: 44 -> Time to Notify: 293 -> Average Time to Start Collection: 472 
 Thread Count: 43 -> Time to Notify: 291 -> Average Time to Start Collection: 1011 
 Thread Count: 42 -> Time to Notify: 291 -> Average Time to Start Collection: 470 
 Thread Count: 41 -> Time to Notify: 294 -> Average Time to Start Collection: 467 
 Thread Count: 40 -> Time to Notify: 287 -> Average Time to Start Collection: 972 
 Thread Count: 39 -> Time to Notify: 293 -> Average Time to Start Collection: 806 
 Thread Count: 38 -> Time to Notify: 291 -> Average Time to Start Collection: 988 
 Thread Count: 37 -> Time to Notify: 309 -> Average Time to Start Collection: 657 
 Thread Count: 36 -> Time to Notify: 301 -> Average Time to Start Collection: 970 
 Thread Count: 35 -> Time to Notify: 293 -> Average Time to Start Collection: 936 
 Thread Count: 34 -> Time to Notify: 292 -> Average Time to Start Collection: 995 
 Thread Count: 33 -> Time to Notify: 293 -> Average Time to Start Collection: 1093 
 Thread Count: 32 -> Time to Notify: 323 -> Average Time to Start Collection: 978 
 Thread Count: 31 -> Time to Notify: 306 -> Average Time to Start Collection: 1024 
 Thread Count: 30 -> Time to Notify: 309 -> Average Time to Start Collection: 967 
 Thread Count: 29 -> Time to Notify: 311 -> Average Time to Start Collection: 1039 
 Thread Count: 28 -> Time to Notify: 313 -> Average Time to Start Collection: 1019 
 Thread Count: 27 -> Time to Notify: 313 -> Average Time to Start Collection: 988 
 Thread Count: 26 -> Time to Notify: 305 -> Average Time to Start Collection: 1015 
 Thread Count: 25 -> Time to Notify: 295 -> Average Time to Start Collection: 1054 
 Thread Count: 24 -> Time to Notify: 296 -> Average Time to Start Collection: 819 
 Thread Count: 23 -> Time to Notify: 312 -> Average Time to Start Collection: 926 
 Thread Count: 22 -> Time to Notify: 296 -> Average Time to Start Collection: 919 
 Thread Count: 21 -> Time to Notify: 289 -> Average Time to Start Collection: 1034 
 Thread Count: 20 -> Time to Notify: 289 -> Average Time to Start Collection: 1068 
 Thread Count: 19 -> Time to Notify: 293 -> Average Time to Start Collection: 1053 
 Thread Count: 18 -> Time to Notify: 300 -> Average Time to Start Collection: 1096 
 Thread Count: 17 -> Time to Notify: 302 -> Average Time to Start Collection: 910 
 Thread Count: 16 -> Time to Notify: 308 -> Average Time to Start Collection: 1010 
 Thread Count: 15 -> Time to Notify: 298 -> Average Time to Start Collection: 968 
 Thread Count: 14 -> Time to Notify: 301 -> Average Time to Start Collection: 1092 
 Thread Count: 13 -> Time to Notify: 301 -> Average Time to Start Collection: 908 
 Thread Count: 12 -> Time to Notify: 299 -> Average Time to Start Collection: 902 
 Thread Count: 11 -> Time to Notify: 299 -> Average Time to Start Collection: 888 
 Thread Count: 10 -> Time to Notify: 296 -> Average Time to Start Collection: 926 
 Thread Count: 9 -> Time to Notify: 291 -> Average Time to Start Collection: 723 
 Thread Count: 8 -> Time to Notify: 290 -> Average Time to Start Collection: 945 
 Thread Count: 7 -> Time to Notify: 296 -> Average Time to Start Collection: 1023 
 Thread Count: 6 -> Time to Notify: 324 -> Average Time to Start Collection: 1102 
 Thread Count: 5 -> Time to Notify: 301 -> Average Time to Start Collection: 993 
 Thread Count: 4 -> Time to Notify: 311 -> Average Time to Start Collection: 569 
 Thread Count: 3 -> Time to Notify: 324 -> Average Time to Start Collection: 1062 
 Thread Count: 2 -> Time to Notify: 310 -> Average Time to Start Collection: 601 
 Thread Count: 1 -> Time to Notify: 308 -> Average Time to Start Collection: 644 

To get around delays in selected threads starting the task (and contention between selected and unselected threads), we can use threads in the order which they are available (wake up / reacquire the mutex) and forgo selecting threads beforehand. With this, we obtain the following results:

 Thread Count: 48 -> Time to Notify: 288 -> Average Time to Start Collection: 470 
 Thread Count: 47 -> Time to Notify: 300 -> Average Time to Start Collection: 498 
 Thread Count: 46 -> Time to Notify: 281 -> Average Time to Start Collection: 497 
 Thread Count: 45 -> Time to Notify: 284 -> Average Time to Start Collection: 439 
 Thread Count: 44 -> Time to Notify: 293 -> Average Time to Start Collection: 446 
 Thread Count: 43 -> Time to Notify: 283 -> Average Time to Start Collection: 520 
 Thread Count: 42 -> Time to Notify: 280 -> Average Time to Start Collection: 425 
 Thread Count: 41 -> Time to Notify: 283 -> Average Time to Start Collection: 434 
 Thread Count: 40 -> Time to Notify: 282 -> Average Time to Start Collection: 432 
 Thread Count: 39 -> Time to Notify: 283 -> Average Time to Start Collection: 412 
 Thread Count: 38 -> Time to Notify: 291 -> Average Time to Start Collection: 470 
 Thread Count: 37 -> Time to Notify: 285 -> Average Time to Start Collection: 501 
 Thread Count: 36 -> Time to Notify: 282 -> Average Time to Start Collection: 427 
 Thread Count: 35 -> Time to Notify: 280 -> Average Time to Start Collection: 401 
 Thread Count: 34 -> Time to Notify: 293 -> Average Time to Start Collection: 392 
 Thread Count: 33 -> Time to Notify: 283 -> Average Time to Start Collection: 377 
 Thread Count: 32 -> Time to Notify: 289 -> Average Time to Start Collection: 443 
 Thread Count: 31 -> Time to Notify: 281 -> Average Time to Start Collection: 419 
 Thread Count: 30 -> Time to Notify: 295 -> Average Time to Start Collection: 462 
 Thread Count: 29 -> Time to Notify: 282 -> Average Time to Start Collection: 408 
 Thread Count: 28 -> Time to Notify: 284 -> Average Time to Start Collection: 431 
 Thread Count: 27 -> Time to Notify: 294 -> Average Time to Start Collection: 430 
 Thread Count: 26 -> Time to Notify: 280 -> Average Time to Start Collection: 389 
 Thread Count: 25 -> Time to Notify: 281 -> Average Time to Start Collection: 367 
 Thread Count: 24 -> Time to Notify: 281 -> Average Time to Start Collection: 362 
 Thread Count: 23 -> Time to Notify: 280 -> Average Time to Start Collection: 377 
 Thread Count: 22 -> Time to Notify: 277 -> Average Time to Start Collection: 360 
 Thread Count: 21 -> Time to Notify: 281 -> Average Time to Start Collection: 366 
 Thread Count: 20 -> Time to Notify: 283 -> Average Time to Start Collection: 377 
 Thread Count: 19 -> Time to Notify: 282 -> Average Time to Start Collection: 470 
 Thread Count: 18 -> Time to Notify: 282 -> Average Time to Start Collection: 376 
 Thread Count: 17 -> Time to Notify: 279 -> Average Time to Start Collection: 342 
 Thread Count: 16 -> Time to Notify: 277 -> Average Time to Start Collection: 391 
 Thread Count: 15 -> Time to Notify: 273 -> Average Time to Start Collection: 368 
 Thread Count: 14 -> Time to Notify: 293 -> Average Time to Start Collection: 341 
 Thread Count: 13 -> Time to Notify: 286 -> Average Time to Start Collection: 388 
 Thread Count: 12 -> Time to Notify: 278 -> Average Time to Start Collection: 376 
 Thread Count: 11 -> Time to Notify: 276 -> Average Time to Start Collection: 315 
 Thread Count: 10 -> Time to Notify: 277 -> Average Time to Start Collection: 369 
 Thread Count: 9 -> Time to Notify: 295 -> Average Time to Start Collection: 440 
 Thread Count: 8 -> Time to Notify: 279 -> Average Time to Start Collection: 313 
 Thread Count: 7 -> Time to Notify: 284 -> Average Time to Start Collection: 395 
 Thread Count: 6 -> Time to Notify: 281 -> Average Time to Start Collection: 408 
 Thread Count: 5 -> Time to Notify: 311 -> Average Time to Start Collection: 473 
 Thread Count: 4 -> Time to Notify: 287 -> Average Time to Start Collection: 309 
 Thread Count: 3 -> Time to Notify: 284 -> Average Time to Start Collection: 395 
 Thread Count: 2 -> Time to Notify: 284 -> Average Time to Start Collection: 456 
 Thread Count: 1 -> Time to Notify: 287 -> Average Time to Start Collection: 309 

With this approach, we see significantly lower Average Times to Start Collection

RSalman commented 4 years ago

fyi @amicic

amicic commented 4 years ago

If we activate the first Worker threads that wake up at initial dispatch, it means that their IDs will not be contiguous?

Not sure if that's important, other than for easier debugging, but if we want them to be contiguous that's probably doable by adding an extra table in the Dispatcher that maps their ID from the pool of Workers into the (contiguous set of) IDs of active Worker threads.

That mapping would be initialized as the threads wake up.

RSalman commented 4 years ago

If we activate the first Worker threads that wake up at initial dispatch, it means that their IDs will not be contiguous?

This is true, it's dependent on the order they wake up.

RSalman commented 4 years ago

Not sure if that's important, other than for easier debugging

It's only noticeable for debugging purposes. For instance, Tgc:parallel logs can look something like this (using 8/48 threads):

Scavenger parallel and progress stats, timestamp="2020-06-23T16:16:27.900"
          gc thrID     busy    stall ....
                   (micros) (micros) ....
SCV.T     14     0      308      196 ....
SCV.T     14     3      340      164 ....
SCV.T     14     4      299      205 ....
SCV.T     14     6      299      205 ....
SCV.T     14    43      299      205 ....
SCV.T     14    44      407       97 ....
SCV.T     14    45      311      193 ....
SCV.T     14    47      421       83 ....
RSalman commented 4 years ago

Parallel Dispatcher Optimization Results (Hybrid Notifications and Improved Thread Selection)

Each platform is tested with two different machines, a machine with a large core count and another with a comparatively smaller core count. Benefits of these optimizations are dependent on the machine size. For instance, the overhead of notify_all is a function of the thread pool size and the # of available h/w threads determines the default pool size.

These tables only present a single sample for each thread count (i.e., only 1 GC is measured with that thread count). These tables are only intended to show relative magnitude, there can variations. Full set of logs are attached (dispatcher-opt.zip) with multiple samples for each thread count.

PPC AIX

image

x86 Linux

image

z/OS s390

image

RSalman commented 4 years ago

XML Validation Benchmark Scores

image

Without considering run-to-run variance, the scores look higher for the optimized dispatcher runs.

We can see clear performance gains when comparing collection times. Below are graphs showing the first ~300 collections from a baseline and an optimized run of the benchmark. We expect to see most gains when the heap is not fully expanded or when the heap contracts. The # of utilized threads is a function of heap size, during initial stages/start up the heap is still expanding hence the optimization would benefit here the most.

image

Avg. Collection Time Baseline: ~2.9ms Avg. Collection Time Opt-Dispatcher: ~1.7ms