ref: Restore explicit vectorization in the Vc AoS plugin

niermann999 commented 5 months ago

This refactors the Vc AoS plugin to use explicitly vectorizing Vc types again. Also adds it to the benchmarks and tests. Since not all functionality is implemented, yet, I split the test suite into three blocks, so that new plugins can be implemented incrementally with testing enabled. Finally, I harmonized the naming between this plugin and the new SoA plugin.

Also removes the warmup from the bencharmks, since google benchmark can be configured to do that for us.

Edit: I refactored the vc_aos::transform3 and matrix44 types, so that are shared between Vc AoS and SoA now

niermann999 commented 5 months ago

Here are some preliminary results for the vector and transform types (matrix operations like in #116 are not ready yet)

cmath

Running ./bin/algebra_benchmark_array_vector
Run on (8 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
Load Average: 1.41, 1.39, 1.35
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
vector_add_single/process_time/threads:8            46276 ns       327862 ns         2192
vector_add_double/process_time/threads:8            67400 ns       525078 ns         1320
vector_sub_single/process_time/threads:8            52227 ns       359485 ns         2080
vector_sub_double/process_time/threads:8            88915 ns       652354 ns         1176
vector_dot_single/process_time/threads:8            62364 ns       413675 ns         1672
vector_dot_double/process_time/threads:8            76434 ns       587386 ns         1208
vector_cross_single/process_time/threads:8          93208 ns       658235 ns          800
vector_cross_double/process_time/threads:8         103723 ns       706038 ns          928
vector_normalize_single/process_time/threads:8     113774 ns       770237 ns          800
vector_normalize_double/process_time/threads:8     133942 ns       969495 ns          824

Running ./bin/algebra_benchmark_array_getter
Run on (8 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
Load Average: 0.46, 0.87, 1.14
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations
-------------------------------------------------------------------------------------
vector_phi_single/process_time/threads:8       847028 ns      5156875 ns          152
vector_phi_double/process_time/threads:8       709006 ns      5201120 ns          136
vector_theta_single/process_time/threads:8     704841 ns      5464596 ns          144
vector_theta_double/process_time/threads:8     786273 ns      5769861 ns          128
vector_perp_single/process_time/threads:8       41554 ns       306343 ns         2728
vector_perp_double/process_time/threads:8       69659 ns       513130 ns         1496
vector_norm_single/process_time/threads:8       52552 ns       414401 ns         2200
vector_norm_double/process_time/threads:8       71149 ns       521248 ns         1496
vector_eta_single/process_time/threads:8       908717 ns      5875246 ns          128
vector_eta_double/process_time/threads:8      1021660 ns      6661088 ns          112

Running ./bin/algebra_benchmark_array_transform3
Run on (8 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
Load Average: 0.52, 0.85, 1.13
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
vector_transform3_single/process_time/threads:8     285703 ns      1898714 ns          312
vector_transform3_double/process_time/threads:8     834718 ns      4412951 ns          232

Vc AoS

Running ./bin/algebra_benchmark_vc_aos_vector
Run on (8 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
Load Average: 1.30, 1.36, 1.34
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
vector_add_single/process_time/threads:8            39951 ns       287446 ns         2600
vector_add_double/process_time/threads:8           102475 ns       728556 ns          952
vector_sub_single/process_time/threads:8            42745 ns       305439 ns         3024
vector_sub_double/process_time/threads:8            95577 ns       685839 ns          800
vector_dot_single/process_time/threads:8            55274 ns       411900 ns         1936
vector_dot_double/process_time/threads:8            99647 ns       684953 ns          800
vector_cross_single/process_time/threads:8          85888 ns       612474 ns         1240
vector_cross_double/process_time/threads:8         207109 ns      1396757 ns          552
vector_normalize_single/process_time/threads:8      92751 ns       639039 ns          800
vector_normalize_double/process_time/threads:8     229359 ns      1673186 ns          448

Running ./bin/algebra_benchmark_vc_aos_getter
Run on (8 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
Load Average: 0.90, 0.89, 1.13
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations
-------------------------------------------------------------------------------------
vector_phi_single/process_time/threads:8       679082 ns      4528424 ns          136
vector_phi_double/process_time/threads:8       774058 ns      6094366 ns          128
vector_theta_single/process_time/threads:8     703891 ns      5364503 ns          136
vector_theta_double/process_time/threads:8     771312 ns      5841033 ns          128
vector_perp_single/process_time/threads:8       44301 ns       321502 ns         2368
vector_perp_double/process_time/threads:8       75251 ns       534540 ns         1304
vector_norm_single/process_time/threads:8       57866 ns       431679 ns         1728
vector_norm_double/process_time/threads:8       87483 ns       540028 ns         1080
vector_eta_single/process_time/threads:8       780283 ns      5854399 ns          112
vector_eta_double/process_time/threads:8      1231129 ns      7451987 ns           80

Running ./bin/algebra_benchmark_vc_aos_transform3
Run on (8 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
Load Average: 0.65, 0.83, 1.10
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
vector_transform3_single/process_time/threads:8     332085 ns      2240918 ns          312
vector_transform3_double/process_time/threads:8     550165 ns      3902846 ns          168

niermann999 commented 5 months ago

Based on #97 , so that the storage vector type can be reused and vc_array4 replaced with something more generic. This will be helpful for generalizing the plugin to higher dimensional types later

beomki-yeo commented 4 months ago

It seems Vc AoS gets slower with double for some cases, is that normal?

niermann999 commented 4 months ago

It seems Vc AoS gets slower with double for some cases, is that normal?

It is expected for vectorized code to be about half as slow in double precision as in single precision (half the number of values fits into the same number of bits in the registers). I believe that the cmath plugin is slower in double precision could be hint that it is in fact partly autovectorized. Why Vc AoS is so much slower than cmath in double precision I don't know, but this is what I have seen before as well.

beomki-yeo commented 4 months ago

Yes I was asking why Vc double is slower than cmath double (not w.r.t float). Thanks for the answer

niermann999 commented 4 months ago

These benchmarks are a bit outdated though (and also done with CPU scaling, so that e.g. addition and subtraction show different results). It works much better running:

./bin/algebra_benchmark_array_vector --benchmark_min_warmup_time=1000 --benchmark_repetitions=50 --benchmark_time_unit=ms --benchmark_display_aggregates_only=true --benchmark_enable_random_interleaving=true

with CPU scaling disabled

niermann999 commented 4 months ago

The CI failure does not seem to be related to this PR, see #123

acts-project / algebra-plugins

ref: Restore explicit vectorization in the Vc AoS plugin #118