ROCm / HIP-CPU

An implementation of HIP that works on CPUs, across OSes.
MIT License
107 stars 19 forks source link

Remove incorrect kernel optimization. #36

Closed fodinabor closed 11 months ago

fodinabor commented 1 year ago

Not encountering a barrier inside the first block does not guarantee that we won't see a barrier in the following blocks. The barrier semantics only require the threads within a block to reach the same barriers.

The included test showcases a simplistic example of a kernel that does satisfy the standard barrier semantics but crashes with the current HIP-CPU implementation.

fodinabor commented 1 year ago

Btw... this actually even improves the performance (at least for all kernels of the performance tests), since until now, all other blocks had to wait for the first block to determine whether or not a barrier is found before being executed in parallel...

Main:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
performance_tests is a Catch v2.13.6 host application.
Run with -? for options

-------------------------------------------------------------------------------
Monte-Carlo PI
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:70
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
CPU                                            100             1     8.30728 s 
                                          76.71 ms    76.0763 ms    77.4913 ms 
                                        3.58548 ms    3.01746 ms    4.36021 ms 

/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:77: FAILED:
  CHECK( PI == (static_cast<double>(n) / niter) * 4.0 )
with expansion:
  Approx( 3.1415926536 ) == 3.1413932

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
HIP-CPU                                        100             1     1.65979 s 
                                        16.5554 ms    16.2211 ms    17.0178 ms 
                                        1.98111 ms    1.50049 ms    2.79446 ms 

/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:96: FAILED:
  CHECK( PI == (static_cast<double>(n) / niter) * 4.0 )
with expansion:
  Approx( 3.1415926536 ) == 3.14065

-------------------------------------------------------------------------------
VADD
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:121
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
CPU                                            100             1     34.2286 s 
                                        291.258 ms    289.239 ms    294.018 ms 
                                        11.9337 ms    9.09765 ms    15.9826 ms 

HIP-CPU                                        100             1     29.2442 s 
                                        277.307 ms    276.274 ms    278.515 ms 
                                        5.67372 ms    4.83868 ms    7.02215 ms 

-------------------------------------------------------------------------------
SGEMM
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:278
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
CPU                                            100             1     14.6344 s 
                                         125.54 ms    124.342 ms    126.976 ms 
                                        6.64994 ms    5.69296 ms    7.73498 ms 

HIP-CPU                                        100             1     1.24133 m 
                                         772.25 ms    764.184 ms    782.212 ms 
                                        45.2482 ms    37.4027 ms     62.551 ms 

-------------------------------------------------------------------------------
N-Body
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:475
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
CPU                                            100             1     1.98585 m 
                                          1.1932 s      1.1826 s     1.21194 s 
                                         70.305 ms    45.3548 ms    106.215 ms 

HIP-CPU 0                                      100             1     25.2277 s 
                                        263.542 ms    259.725 ms    267.309 ms 
                                        19.3825 ms    17.5549 ms    21.6322 ms 

HIP-CPU 1                                      100             1     26.6871 s 
                                         231.13 ms    229.313 ms    233.907 ms 
                                        11.2776 ms    8.36791 ms    19.0644 ms 

HIP-CPU 2                                      100             1     1.15331 m 
                                        696.486 ms    693.559 ms    699.582 ms 
                                        15.3277 ms    13.3719 ms    18.0437 ms 

-------------------------------------------------------------------------------
N-Queens
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:865
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
CPU - Naive                                    100         87679          0 ns 
                                        2.48727 ns    2.47303 ns    2.50961 ns 
                                      0.0892655 ns  0.0643914 ns   0.128073 ns 

CPU - Parallel                                 100             1    53.4986 ms 
                                        520.458 us    508.051 us    534.305 us 
                                        66.9775 us    58.8066 us    78.5709 us 

GPU - Parallel                                 100             1    26.3587 ms 
                                         240.73 us     228.69 us    259.964 us 
                                        76.3305 us    53.3468 us    116.112 us 

GPU - Optimised                                100             1    28.7426 ms 
                                        227.102 us    219.844 us    240.266 us 
                                        48.6692 us    30.4105 us    83.7005 us 

-------------------------------------------------------------------------------
HAXPY
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:910
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
HAXPY                                          100             1    10.8723 ms 
                                        87.6069 us     84.231 us    92.8906 us 
                                        21.1628 us    14.9325 us    29.7977 us 

HAXPY-native                                   100             1    20.4512 ms 
                                        166.409 us    162.241 us    171.686 us 
                                        23.7347 us    19.2929 us    31.2847 us 

===============================================================================
test cases:     6 |     5 passed | 1 failed
assertions: 26956 | 26954 passed | 2 failed

This patch:


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
performance_tests is a Catch v2.13.6 host application.
Run with -? for options

-------------------------------------------------------------------------------
Monte-Carlo PI
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:70
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
CPU                                            100             1     7.32708 s 
                                        75.6154 ms    74.9308 ms    76.4694 ms 
                                        3.89684 ms     3.2427 ms    4.67625 ms 

HIP-CPU                                        100             1     1.67327 s 
                                        14.9517 ms    14.5308 ms    15.3304 ms 
                                        2.03369 ms    1.75302 ms    2.46693 ms 

/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:96: FAILED:
  CHECK( PI == (static_cast<double>(n) / niter) * 4.0 )
with expansion:
  Approx( 3.1415926536 ) == 3.1416468

-------------------------------------------------------------------------------
VADD
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:121
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
CPU                                            100             1     34.1128 s 
                                        301.186 ms    298.573 ms    304.184 ms 
                                        14.2489 ms    12.3201 ms    16.7271 ms 

HIP-CPU                                        100             1     27.8238 s 
                                          272.3 ms    271.602 ms    273.108 ms 
                                        3.80903 ms    3.21205 ms    4.78626 ms 

-------------------------------------------------------------------------------
SGEMM
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:278
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
CPU                                            100             1     14.8577 s 
                                        126.228 ms    125.291 ms    127.483 ms 
                                        5.50916 ms    4.47272 ms    7.54921 ms 

HIP-CPU                                        100             1      1.0842 m 
                                        656.881 ms    651.239 ms    663.306 ms 
                                        30.6308 ms    26.1941 ms    36.3297 ms 

-------------------------------------------------------------------------------
N-Body
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:475
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
CPU                                            100             1     2.03542 m 
                                         1.14952 s     1.14056 s     1.16179 s 
                                          53.18 ms    41.5814 ms    70.4657 ms 

HIP-CPU 0                                      100             1     24.0217 s 
                                        221.189 ms      220.5 ms    222.069 ms 
                                        3.94917 ms    3.30045 ms    5.08226 ms 

HIP-CPU 1                                      100             1     20.8636 s 
                                        210.983 ms    210.021 ms    212.377 ms 
                                        5.84877 ms    4.28386 ms    8.19364 ms 

HIP-CPU 2                                      100             1     1.11553 m 
                                        658.155 ms    654.364 ms    663.956 ms 
                                        23.6414 ms    16.9554 ms    35.3805 ms 

-------------------------------------------------------------------------------
N-Queens
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:865
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
CPU - Naive                                    100         93908          0 ns 
                                        2.15085 ns    2.14657 ns    2.15887 ns 
                                      0.0288525 ns  0.0188323 ns  0.0493083 ns 

CPU - Parallel                                 100             1    53.3928 ms 
                                        486.055 us    473.915 us    503.213 us 
                                        72.6388 us    55.2343 us    107.727 us 

GPU - Parallel                                 100             1    24.9036 ms 
                                        196.201 us    191.615 us    200.837 us 
                                        23.4965 us    20.8985 us    26.7478 us 

GPU - Optimised                                100             1    26.7999 ms 
                                        207.862 us    202.519 us    213.795 us 
                                         28.682 us    24.8298 us    35.9623 us 

-------------------------------------------------------------------------------
HAXPY
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:910
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
HAXPY                                          100             1      7.983 ms 
                                        71.2243 us    67.5569 us    76.9837 us 
                                        23.1313 us    15.8098 us     32.537 us 

HAXPY-native                                   100             1    13.5236 ms 
                                        112.074 us    107.964 us    117.169 us 
                                        23.1151 us    19.0588 us    28.5149 us 

===============================================================================
test cases:     6 |     5 passed | 1 failed
assertions: 26956 | 26955 passed | 1 failed
AlexVlx commented 1 year ago

@fodinabor very nice, thanks, and apologies for the delayed reply. I'm happy to merge this once you've had a chance to go through the comments. Cheers!

fodinabor commented 1 year ago

Good news :) What comments do you mean?

AlexVlx commented 11 months ago

Merged, thank you ever so much @fodinabor!