acts-project / algebra-plugins

Mozilla Public License 2.0
3 stars 10 forks source link

Turn on array_cmath vectorization #119

Closed beomki-yeo closed 5 months ago

beomki-yeo commented 5 months ago

This PR turns on the CPU vectorization for array_cmath which will be followed by dedicated optimization in #116

Not sure if we need to do this commonly for any other plugins. maybe I also need to add cmake condition that checks if the CPU supoorts avx. And is there any potential risk or downside in enabling vectorization?

At the moment, it seems to be useful as detray CPU benchmark gets faster about 10% by enabling the vectorization.

cmake ../detray -DDETRAY_CUSTOM_SCALARTYPE=float -DCMAKE_BUILD_TYPE=Release -DDETRAY_EIGEN_PLUGIN=OFF -DDETRAY_VC_PLUGIN=OFF -DDETRAY_BUILD_CUDA=OFF
./bin/detray_benchmark_cpu_array

detray main

[beomki@pc-54 detray_build]$ ./bin/detray_benchmark_cpu_array 
2024-04-12T22:47:23+02:00
Running ./bin/detray_benchmark_cpu_array
Run on (12 X 4066.38 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 6.67, 6.82, 5.22
----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
BM_FIND_VOLUMES                               ERROR OCCURRED: 'Benchmark disabled for the toy geometry'
BM_GRID_REGULAR_BIN_CAP1/process_time               6.49 ms         6.35 ms          110
BM_GRID_REGULAR_BIN_CAP4/process_time               9.31 ms         9.30 ms           71
BM_GRID_REGULAR_BIN_CAP25/process_time              21.8 ms         21.8 ms           32
BM_GRID_REGULAR_BIN_CAP100/process_time             52.1 ms         52.0 ms           13
BM_GRID_REGULAR_NEIGHBOR_CAP1/process_time          47.0 ms         46.9 ms           15
BM_GRID_REGULAR_NEIGHBOR_CAP4/process_time           238 ms          237 ms            3
BM_GRID_IRREGULAR_BIN_CAP1/process_time             54.5 ms         54.4 ms           13
BM_GRID_IRREGULAR_NEIGHBOR_CAP1/process_time        93.8 ms         93.6 ms            7
BM_GRID2_REGULAR_BIN_CAP1/process_time              4.07 ms         4.07 ms          173
BM_GRID2_REGULAR_BIN_CAP4/process_time              3.89 ms         3.88 ms          179
BM_GRID2_REGULAR_BIN_CAP25/process_time             11.7 ms         11.6 ms           65
BM_GRID2_REGULAR_BIN_CAP100/process_time            42.2 ms         42.2 ms           17
BM_GRID2_REGULAR_NEIGHBOR_CAP1/process_time         89.1 ms         89.0 ms            8
BM_GRID2_REGULAR_NEIGHBOR_CAP4/process_time          843 ms          841 ms            1
BM_GRID2_IRREGULAR_BIN_CAP1/process_time            51.5 ms         51.4 ms           13
BM_GRID2_IRREGULAR_NEIGHBOR_CAP1/process_time        143 ms          142 ms            5
WARNING: No entries in volume finder
WARNING: No entries in volume finder
BM_INTERSECT_ALL                                     406 ms          405 ms            2
BM_INTERSECT_PLANES                                0.744 ms        0.743 ms          936
BM_INTERSECT_CYLINDERS                              3.49 ms         3.48 ms          201
BM_INTERSECT_PORTAL_CYLINDERS                       3.00 ms         2.99 ms          235
BM_INTERSECT_CONCETRIC_CYLINDERS                    4.22 ms         4.21 ms          167
BM_MASK_CUBOID_3D                                   2604 ms         2600 ms            1
BM_MASK_RECTANGLE_2D                                2272 ms         2268 ms            1
BM_MASK_TRAPEZOID_2D                                3018 ms         3013 ms            1
BM_MASK_DISC_2D                                    18307 ms        18279 ms            1
BM_MASK_RING_2D                                    18281 ms        18211 ms            1
BM_MASK_CYLINDER_3D                                20712 ms        20676 ms            1
BM_MASK_CYLINDER_2D                                18651 ms        18617 ms            1
BM_MASK_CONCENTRIC_CYLINDER_2D                     16388 ms        16358 ms            1
BM_MASK_ANNULUS_2D                                 23672 ms        23628 ms            1
BM_MASK_LINE_CIRCULAR                              21320 ms        21263 ms            1
BM_MASK_LINE_SQUARE                                36643 ms        36561 ms            1

detray with this PR

[beomki@pc-54 detray_build]$ ./bin/detray_benchmark_cpu_array 
2024-04-12T22:57:49+02:00
Running ./bin/detray_benchmark_cpu_array
Run on (12 X 4403.37 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 3.28, 4.77, 4.87
----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
BM_FIND_VOLUMES                               ERROR OCCURRED: 'Benchmark disabled for the toy geometry'
BM_GRID_REGULAR_BIN_CAP1/process_time               5.80 ms         5.80 ms          120
BM_GRID_REGULAR_BIN_CAP4/process_time               8.58 ms         8.57 ms           81
BM_GRID_REGULAR_BIN_CAP25/process_time              29.2 ms         29.1 ms           24
BM_GRID_REGULAR_BIN_CAP100/process_time             49.7 ms         49.6 ms           14
BM_GRID_REGULAR_NEIGHBOR_CAP1/process_time          46.3 ms         46.0 ms           16
BM_GRID_REGULAR_NEIGHBOR_CAP4/process_time           234 ms          234 ms            3
BM_GRID_IRREGULAR_BIN_CAP1/process_time             46.1 ms         46.1 ms           15
BM_GRID_IRREGULAR_NEIGHBOR_CAP1/process_time        78.8 ms         78.7 ms            9
BM_GRID2_REGULAR_BIN_CAP1/process_time              4.19 ms         4.18 ms          164
BM_GRID2_REGULAR_BIN_CAP4/process_time              3.71 ms         3.70 ms          192
BM_GRID2_REGULAR_BIN_CAP25/process_time             11.4 ms         11.4 ms           60
BM_GRID2_REGULAR_BIN_CAP100/process_time            38.2 ms         38.2 ms           18
BM_GRID2_REGULAR_NEIGHBOR_CAP1/process_time         87.2 ms         87.0 ms            8
BM_GRID2_REGULAR_NEIGHBOR_CAP4/process_time          828 ms          827 ms            1
BM_GRID2_IRREGULAR_BIN_CAP1/process_time            47.6 ms         47.6 ms           13
BM_GRID2_IRREGULAR_NEIGHBOR_CAP1/process_time        134 ms          134 ms            5
WARNING: No entries in volume finder
WARNING: No entries in volume finder
BM_INTERSECT_ALL                                     374 ms          374 ms            2
BM_INTERSECT_PLANES                                0.675 ms        0.674 ms         1015
BM_INTERSECT_CYLINDERS                              3.13 ms         3.12 ms          223
BM_INTERSECT_PORTAL_CYLINDERS                       2.81 ms         2.81 ms          247
BM_INTERSECT_CONCETRIC_CYLINDERS                    3.68 ms         3.67 ms          190
BM_MASK_CUBOID_3D                                   2253 ms         2250 ms            1
BM_MASK_RECTANGLE_2D                                1933 ms         1930 ms            1
BM_MASK_TRAPEZOID_2D                                2739 ms         2735 ms            1
BM_MASK_DISC_2D                                    16200 ms        16173 ms            1
BM_MASK_RING_2D                                    16199 ms        16169 ms            1
BM_MASK_CYLINDER_3D                                18466 ms        18437 ms            1
BM_MASK_CYLINDER_2D                                17062 ms        17022 ms            1
BM_MASK_CONCENTRIC_CYLINDER_2D                     15152 ms        15044 ms            1
BM_MASK_ANNULUS_2D                                 22031 ms        21964 ms            1
BM_MASK_LINE_CIRCULAR                              18937 ms        18910 ms            1
BM_MASK_LINE_SQUARE                                33487 ms        33444 ms            1
beomki-yeo commented 5 months ago

Ok then i will take the change to the detray benchmark side

krasznaa commented 5 months ago

Notice also how the ATLAS code skips on setting anything itself if an -march flag is already set. Let's say through the $CXXFLAGS environment variable. (Which is something that we do in some special nightlies of ours.) We may want to look out for such a thing even in the R&D projects. :thinking: (Since one could argue that CMake should never set such things at all. And it would be up to a layer above it, let's say in Spack, where we would request such optimizations. But since I'm not a fan of Spack myself, I'm still okay with CMake making some of these decisions itself...)