Multithreaded performance for small dense problems has regressed

sandwichmaker commented 7 months ago

@rlabbe and @cfneuhaus report slowdown on small dense problems with multiple threads.

@rlabbe reported

Solver Summary (v 2.1.0-eigen-(3.4.90)-no_lapack-eigensparse-no_openmp)

                                     Original                  Reduced
Parameter blocks                            1                        1
Parameters                                 12                       12
Effective parameters                        7                        7
Residual blocks                           127                      127
Residuals                                 254                      254
<snip>
Threads                                     7                        7
<snip>
Time (in seconds):
Preprocessor                         0.000082

  Residual only evaluation           0.000129 (7)
  Jacobian & residual evaluation     0.000386 (7)
  Linear solver                      0.000047 (7)
Minimizer                            0.000629

Postprocessor                        0.000002
Total                                0.000713

v/s

Solver Summary (v 2.2.0-eigen-(3.4.90)-no_lapack-eigensparse-no_custom_blas)

                                     Original                  Reduced
Parameter blocks                            1                        1
Parameters                                 12                       12
Effective parameters                        7                        7
Residual blocks                           124                      124
Residuals                                 248                      248

<snip>
Threads                                     7                        7
<snip>
Time (in seconds):
Preprocessor                         0.000112

  Residual only evaluation           0.000065 (4)
  Jacobian & residual evaluation     0.000798 (4)
  Linear solver                      0.000043 (4)
Minimizer                            0.001163

Postprocessor                        0.000004
Total                                0.001279

But in 2.2 I am seeing a huge performance decrease with more threads. 1 thread gives the best performance, competitive with 2.1 (I need more study to quantify that). 2 threads is slower. And so on. If I use as many threads as I have cores (24) solve times are around 50msec vs 500usec-1msec with one thread. In contrast 2.1 performance degrades a bit if you give it excessive cores, but nothing like this (~395usec for 1 thread, ~600usec for 24 threads).

He narrowed it down to commit b158515089a85be8425db66ed43546605f86a00e

In his tests

1  976.30 usec
2    1.05 msec
3    2.01 msec
...
23  20.97 msec

performance degrades pretty substantially with increasing number of threads.

@cfneuhaus also reports similar problems.

Solver Summary (v 2.1.0-eigen-(3.4.0)-lapack-suitesparse-()-no_openmp)

                                     Original                  Reduced
Parameter blocks                            1                        1
Parameters                                  6                        6
Residual blocks                          1759                     1759
Residuals                                1761                     1761

<SNIP>
Threads                                    24                       24
Linear solver ordering              AUTOMATIC                        1

<SNIP>

Time (in seconds):
Preprocessor                         0.000370

  Residual only evaluation           0.001278 (10)
  Jacobian & residual evaluation     0.001578 (10)
  Linear solver                      0.000251 (10)
Minimizer                            0.003371

Postprocessor                        0.000010
Total                                0.003750

v/s

Solver Summary (v 2.2.0-eigen-(3.4.0)-lapack-suitesparse-())

                                     Original                  Reduced
Parameter blocks                            1                        1
Parameters                                  6                        6
Residual blocks                          1759                     1759
Residuals                                1761                     1761
<SNIP>
Threads                                    24                       24
<SNIP>
Time (in seconds):
Preprocessor                         0.002062

  Residual only evaluation           0.000789 (10)
  Jacobian & residual evaluation     0.032251 (10)
  Linear solver                      0.000408 (10)
Minimizer                            0.037328

Postprocessor                        0.000012
Total                                0.039403

and when compared to single threaded performance on the same problem

2.1.0, 24 Threads: Total 0.003750
2.2.0, 24 Threads: Total 0.039403
2.1.0, 1 Thread: Total 0.005006
2.2.0, 1 Thread: Total 0.004828

Single threaded performance of the two versions seems to be the same but multithreaded performance has gotten worse substantially.

sandwichmaker commented 7 months ago

cc: @DmitriyKorchemkin

DmitriyKorchemkin commented 7 months ago

Parallel operations on small vectors are indeed very slow

dkorchemkin@home:~/oss/ceres-solver/build$ numactl -m0 -N0 ./bin/parallel_vector_operations_benchmark
2023-10-06T16:41:41+00:00                                                                                                                 
Running ./bin/parallel_vector_operations_benchmark                                                                                        
Run on (112 X 3200 MHz CPU s)                                                                                                          
CPU Caches:                                                                                                                               
  L1 Data 32 KiB (x56)                                                                                                                    
  L1 Instruction 32 KiB (x56)                                                                                                             
  L2 Unified 1024 KiB (x56)                                                                                                               
  L3 Unified 39424 KiB (x2)                                                                                                               
Load Average: 1.36, 2.46, 2.52                                                                                                            
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
SetZero/128                         12.8 ns         12.8 ns     54510180
SetZero/256                         22.8 ns         22.8 ns     30651968
SetZero/1024                        51.3 ns         51.3 ns     13566727
SetZero/4096                         187 ns          187 ns      3717838
SetZero/16384                       2118 ns         2118 ns       330561
SetZero/65536                       8445 ns         8445 ns        83189
SetZero/262144                     65333 ns        65325 ns        10897
SetZero/1048576                   339241 ns       339234 ns         2056
SetZero/4194304                  2012458 ns      2012283 ns          398
SetZero/8388608                  9881642 ns      9879515 ns           73
SetZeroParallel/128/1               13.8 ns         13.8 ns     50736474
SetZeroParallel/128/2               3673 ns         2876 ns       245485
SetZeroParallel/128/4              10042 ns         8163 ns        83980
SetZeroParallel/128/8              25209 ns        20324 ns        33505
SetZeroParallel/128/16             52716 ns        43483 ns        15548
SetZeroParallel/256/1               26.4 ns         26.4 ns     26584992
SetZeroParallel/256/2               3675 ns         2892 ns       235985
SetZeroParallel/256/4              10160 ns         8234 ns        85836
SetZeroParallel/256/8              25094 ns        20286 ns        33834
SetZeroParallel/256/16             54485 ns        44815 ns        16205
SetZeroParallel/1024/1              54.1 ns         54.1 ns     12529229
SetZeroParallel/1024/2              4131 ns         3314 ns       209834
SetZeroParallel/1024/4             10590 ns         8644 ns        80252
SetZeroParallel/1024/8             26006 ns        20648 ns        33643
SetZeroParallel/1024/16            53349 ns        43928 ns        15431
SetZeroParallel/4096/1               189 ns          189 ns      3673344
SetZeroParallel/4096/2              6014 ns         5135 ns       135995
SetZeroParallel/4096/4             11459 ns         9486 ns        71060
SetZeroParallel/4096/8             25476 ns        20449 ns        32472
SetZeroParallel/4096/16            53848 ns        44105 ns        15512
SetZeroParallel/16384/1             2127 ns         2127 ns       327985
SetZeroParallel/16384/2            10320 ns         9189 ns        75872
SetZeroParallel/16384/4            13274 ns        11633 ns        58642
SetZeroParallel/16384/8            26719 ns        21349 ns        34300
SetZeroParallel/16384/16           55862 ns        46483 ns        15580
SetZeroParallel/65536/1             8474 ns         8474 ns        81212
SetZeroParallel/65536/2            24117 ns        19293 ns        35435
SetZeroParallel/65536/4            20482 ns        18834 ns        36977
SetZeroParallel/65536/8            26524 ns        21256 ns        32848
SetZeroParallel/65536/16           54782 ns        44951 ns        15322
SetZeroParallel/262144/1           64730 ns        64728 ns        10774
SetZeroParallel/262144/2           73088 ns        64160 ns         9716
SetZeroParallel/262144/4           57310 ns        51761 ns        10000
SetZeroParallel/262144/8           40152 ns        35538 ns        18878
SetZeroParallel/262144/16          59208 ns        47971 ns        14780
SetZeroParallel/1048576/1         340296 ns       340293 ns         2059
SetZeroParallel/1048576/2         271003 ns       250716 ns         2538
SetZeroParallel/1048576/4         183528 ns       168460 ns         4151
SetZeroParallel/1048576/8         112738 ns       102182 ns         6854
SetZeroParallel/1048576/16         83996 ns        73060 ns         9545
SetZeroParallel/4194304/1        1693640 ns      1693623 ns          439
SetZeroParallel/4194304/2         921791 ns       909744 ns          814
SetZeroParallel/4194304/4         772264 ns       701767 ns          998
SetZeroParallel/4194304/8         427478 ns       394644 ns         1785
SetZeroParallel/4194304/16        263210 ns       239946 ns         2869
SetZeroParallel/8388608/1        9869803 ns      9867998 ns           71
SetZeroParallel/8388608/2        5065903 ns      5059889 ns          140
SetZeroParallel/8388608/4        2617067 ns      2541684 ns          272
SetZeroParallel/8388608/8        1348521 ns      1329147 ns          535
SetZeroParallel/8388608/16        877319 ns       852327 ns          816

Note that (for small vector size, let say 128) it scales kinda-linearly with the number of threads (2,4,8,...), which suggests that this is overhead of launching tasks via thread pool.

(parallel benchmark with 1 thread measures cost of going through ParallelFor invocation to InvokeSerial call)

Adding threshold on number of vector elements ( https://ceres-solver-review.googlesource.com/c/ceres-solver/+/25280 ) seem to solve this problem, here are benchmark results https://docs.google.com/spreadsheets/d/1qfxrJ1MhzEFmyhWYWN4RY3EJcoJF6d6nv4oJkdK_EaA/edit#gid=1902184755

For each sub-table

Horizontal axis: number of threads
Vertical axis: vector size
Value: ratio of time consumed by parallel and serial versions of the benchmark (less = better)
Cell color: green = 0.5, red = 2.0 (green = good)

sandwichmaker commented 7 months ago

It seems to me that alloca doesn't really help or am I missing something?

DmitriyKorchemkin commented 7 months ago

It is only meaningful for Dot and Norm. And for those benchmarks it reduces overhead in the smallest case.

Dot	1	2	4	8	16
128, std::vector	1.973E+00	1.973E+00	1.975E+00	2.094E+00	2.410E+00
128, alloca	1.559E+00	1.525E+00	1.590E+00	1.719E+00	2.243E+00

Norm	1	2	4	8	16
128, std::vector	1.659E+00	1.799E+00	1.867E+00	1.994E+00	2.291E+00
128, alloca	1.328E+00	1.356E+00	1.407E+00	1.543E+00	1.829E+00

DmitriyKorchemkin commented 7 months ago

Another problem is lock contention in ParallelFor where num_threads tasks are pushed into the queue of ThreadPool one-after-another by the main thread (while ThreadPool starts executing tasks popping them from the queue).

I think we should put the next task into the queue from the task itself.

This way there almost always should be only one thread trying to do something with ConcurrentQueue inside ThreadPool.

Pros:

Main thread starts doing some work almost immediately
We can create less tasks if synchronization overhead is significantly larger than the amount of work

Cons:

Does not solve the issue of 2 threads being slower than one

I have added SchedulerBenchmark that sets 1 value to zero per iteration.

With the old scheme of launching all tasks from the main thread:

 ---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
SchedulerBenchmark/128/1         14.1 ns         14.1 ns     49496153
SchedulerBenchmark/128/2         3965 ns         3093 ns       240173
SchedulerBenchmark/128/4        13162 ns         9754 ns        71478
SchedulerBenchmark/128/8        30643 ns        23366 ns        29614
SchedulerBenchmark/128/16       63694 ns        50700 ns        10000
SchedulerBenchmark/256/1         24.1 ns         24.1 ns     28943598
SchedulerBenchmark/256/2         3878 ns         3069 ns       227498
SchedulerBenchmark/256/4        13293 ns         9939 ns        69817
SchedulerBenchmark/256/8        31117 ns        23662 ns        32640
SchedulerBenchmark/256/16       59503 ns        47098 ns        14910
SchedulerBenchmark/1024/1        56.7 ns         56.7 ns     12048398
SchedulerBenchmark/1024/2        4346 ns         3492 ns       203140
SchedulerBenchmark/1024/4       13487 ns        10324 ns        66736
SchedulerBenchmark/1024/8       30982 ns        23797 ns        33090
SchedulerBenchmark/1024/16      63199 ns        50200 ns        14762
SchedulerBenchmark/4096/1         189 ns          189 ns      3633540
SchedulerBenchmark/4096/2        5932 ns         5141 ns       131884
SchedulerBenchmark/4096/4       14784 ns        11539 ns        61236
SchedulerBenchmark/4096/8       35857 ns        27114 ns        29276
SchedulerBenchmark/4096/16      63934 ns        50643 ns        10000

With the proposed scheme:

 ---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
SchedulerBenchmark/128/1         14.1 ns         14.1 ns     49323498
SchedulerBenchmark/128/2         2411 ns         1907 ns       362916
SchedulerBenchmark/128/4         3556 ns         2809 ns       243026
SchedulerBenchmark/128/8         4346 ns         3487 ns       200626
SchedulerBenchmark/128/16        5066 ns         4245 ns       169698
SchedulerBenchmark/256/1         24.2 ns         24.2 ns     28960018
SchedulerBenchmark/256/2         2330 ns         1906 ns       388470
SchedulerBenchmark/256/4         3864 ns         3029 ns       219233
SchedulerBenchmark/256/8         4399 ns         3528 ns       195225
SchedulerBenchmark/256/16        5111 ns         4255 ns       161858
SchedulerBenchmark/1024/1        55.9 ns         55.9 ns     12204777
SchedulerBenchmark/1024/2        2541 ns         2083 ns       329807
SchedulerBenchmark/1024/4        3977 ns         3259 ns       222628
SchedulerBenchmark/1024/8        4607 ns         3726 ns       193548
SchedulerBenchmark/1024/16       5031 ns         4175 ns       160285
SchedulerBenchmark/4096/1         188 ns          188 ns      3714433
SchedulerBenchmark/4096/2        4203 ns         3641 ns       188284
SchedulerBenchmark/4096/4        4832 ns         4057 ns       171811
SchedulerBenchmark/4096/8        5605 ns         4469 ns       159093
SchedulerBenchmark/4096/16       6425 ns         5505 ns       126861

This makes too short parallel tasks and too much threads less painful.

Submitted as a separate CL: https://ceres-solver-review.googlesource.com/c/ceres-solver/+/25300

sandwichmaker commented 7 months ago

@DmitriyKorchemkin shall we get this latter patch in, since it seems like a pure win? and then re-evaluate your first patch?

DmitriyKorchemkin commented 7 months ago

@sandwichmaker yes, I think that we should merge the last one first.

It will at least make the growth with the number of threads less pronounced (but the gap between 1 and 2 threads is still huge).

sandwichmaker commented 7 months ago

okay so the scheduler change is in, lets look at the performance of https://ceres-solver-review.git.corp.google.com/c/ceres-solver/+/25280 on top of HEAD now.

rlabbe commented 7 months ago

I checked out HEAD, numbers are now very competitive, with 2.2 better than 2.1 until you overspecify # of threads, but only by a modest amount:

1 thread performance

Ceres 2.1 mean: 1.15 msec std: 552.40 usec (200 points) Ceres 2.2 mean: 1.07 msec std: 481.22 usec (200 points) Ceres 2.1 mean: 3.56 msec std: 1.69 msec (1000 points) Ceres 2.2 mean: 3.54 msec std: 1.61 msec (1000 points)

2 thread performance

Ceres 2.1 mean: 921.03 usec std: 359.30 usec (200 points) Ceres 2.2 mean: 913.78 usec std: 367.30 usec (200 points)

Ceres 2.1 mean: 2.58 msec std: 1.16 msec (1000 points) Ceres 2.2 mean: 2.45 msec std: 1.07 msec (1000 points)

4 thread performance

Ceres 2.1 mean: 845.64 usec std: 416.76 usec (200 points) Ceres 2.2 mean: 770.07 usec std: 354.09 usec (200 points)

Ceres 2.1 mean: 1.96 msec std: 998.91 usec (1000 points) Ceres 2.2 mean: 1.73 msec std: 725.61 usec (1000 points

8 thread performance

Ceres 2.1 mean: 801.78 usec std: 407.24 usec (200 points) Ceres 2.2 mean: 944.09 usec std: 413.59 usec (200 points)

Ceres 2.1 mean: 1.85 msec std: 919.09 usec (1000 points) Ceres 2.2 mean: 1.96 msec std: 858.14 usec (1000 points)

12 thread performance

Ceres 2.1 mean: 814.86 usec std: 337.59 usec (200 points) Ceres 2.2 mean: 1.12 msec std: 554.43 usec (200 points)

Ceres 2.1 mean: 1.65 msec std: 564.98 usec (1000 points) Ceres 2.2 mean: 1.80 msec std: 604.28 usec (1000 points)

16 thread performance

Ceres 2.1 mean: 946.84 usec std: 486.15 usec (200 points) Ceres 2.2 mean: 1.30 msec std: 527.63 usec (200 points)

Ceres 2.1 mean: 1.63 msec std: 605.31 usec (1000 points) Ceres 2.2 mean: 1.89 msec std: 590.79 usec (1000 points)

23 thread performance (don't ask why 23 vs 24, nothing to do with ceres)

Ceres 2.1 mean: 1.10 msec std: 428.13 usec (200 points) Ceres 2.2 mean: 1.60 msec std: 544.80 usec

Ceres 2.1 mean: 1.71 msec std: 521.12 usec (1000 points) Ceres 2.2 mean: 2.25 msec std: 698.23 usec (1000 points)

DmitriyKorchemkin commented 7 months ago

@sandwichmaker here are benchmark results with old and new task enqueuing: https://docs.google.com/spreadsheets/d/1Mdg6IqvqZd8cRLw69KAf8UM9WcscQmENBay6QpWflJk/edit?usp=sharing It significantly improves worst-case performance.

Sheet vector-ops.limit.old.json: old enqueuing, no limit on minimal block size
Sheets vector-ops.limit.X.json: new enqueuing, minimal block size is 2^X (minimal block size = 1 is equivalent to no limit)
On all sheets green = 0.5, red = 2.0 (green = faster than single-threaded)
Rightmost group of columns corresponds to stats aggregated by row
Bottom-most group of rows corresponds to stats aggregated by column

I think it makes sense to set 2^14..2^20 as a limit; I've chosen 2^16 as a "round" number.

sandwichmaker commented 7 months ago

@DmitriyKorchemkin I am only seeing one sheet.

DmitriyKorchemkin commented 7 months ago

@sandwichmaker Strange, should be 8 sheets. Please find the exported file attached.

2023-10-07 ceres-solver parallel vector ops.ods

sandwichmaker commented 7 months ago

ah it was a google account issue. from my official google account I only see one sheet and on sandwichmaker@gmail.com I can see all the sheets.

sandwichmaker commented 7 months ago

Both Roger and Frank have confirmed that @DmitriyKorchemkin's two changes have fixed this issue.

ceres-solver / ceres-solver

Multithreaded performance for small dense problems has regressed #1016