Closed niermann999 closed 2 weeks ago
Here are some preliminary results for the vector and transform types (matrix operations like in #116 are not ready yet)
cmath
Running ./bin/algebra_benchmark_array_vector
Run on (8 X 4400 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB (x1)
Load Average: 1.41, 1.39, 1.35
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------
vector_add_single/process_time/threads:8 46276 ns 327862 ns 2192
vector_add_double/process_time/threads:8 67400 ns 525078 ns 1320
vector_sub_single/process_time/threads:8 52227 ns 359485 ns 2080
vector_sub_double/process_time/threads:8 88915 ns 652354 ns 1176
vector_dot_single/process_time/threads:8 62364 ns 413675 ns 1672
vector_dot_double/process_time/threads:8 76434 ns 587386 ns 1208
vector_cross_single/process_time/threads:8 93208 ns 658235 ns 800
vector_cross_double/process_time/threads:8 103723 ns 706038 ns 928
vector_normalize_single/process_time/threads:8 113774 ns 770237 ns 800
vector_normalize_double/process_time/threads:8 133942 ns 969495 ns 824
Running ./bin/algebra_benchmark_array_getter
Run on (8 X 4400 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB (x1)
Load Average: 0.46, 0.87, 1.14
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------
vector_phi_single/process_time/threads:8 847028 ns 5156875 ns 152
vector_phi_double/process_time/threads:8 709006 ns 5201120 ns 136
vector_theta_single/process_time/threads:8 704841 ns 5464596 ns 144
vector_theta_double/process_time/threads:8 786273 ns 5769861 ns 128
vector_perp_single/process_time/threads:8 41554 ns 306343 ns 2728
vector_perp_double/process_time/threads:8 69659 ns 513130 ns 1496
vector_norm_single/process_time/threads:8 52552 ns 414401 ns 2200
vector_norm_double/process_time/threads:8 71149 ns 521248 ns 1496
vector_eta_single/process_time/threads:8 908717 ns 5875246 ns 128
vector_eta_double/process_time/threads:8 1021660 ns 6661088 ns 112
Running ./bin/algebra_benchmark_array_transform3
Run on (8 X 4400 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB (x1)
Load Average: 0.52, 0.85, 1.13
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------
vector_transform3_single/process_time/threads:8 285703 ns 1898714 ns 312
vector_transform3_double/process_time/threads:8 834718 ns 4412951 ns 232
Vc AoS
Running ./bin/algebra_benchmark_vc_aos_vector
Run on (8 X 4400 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB (x1)
Load Average: 1.30, 1.36, 1.34
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------
vector_add_single/process_time/threads:8 39951 ns 287446 ns 2600
vector_add_double/process_time/threads:8 102475 ns 728556 ns 952
vector_sub_single/process_time/threads:8 42745 ns 305439 ns 3024
vector_sub_double/process_time/threads:8 95577 ns 685839 ns 800
vector_dot_single/process_time/threads:8 55274 ns 411900 ns 1936
vector_dot_double/process_time/threads:8 99647 ns 684953 ns 800
vector_cross_single/process_time/threads:8 85888 ns 612474 ns 1240
vector_cross_double/process_time/threads:8 207109 ns 1396757 ns 552
vector_normalize_single/process_time/threads:8 92751 ns 639039 ns 800
vector_normalize_double/process_time/threads:8 229359 ns 1673186 ns 448
Running ./bin/algebra_benchmark_vc_aos_getter
Run on (8 X 4400 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB (x1)
Load Average: 0.90, 0.89, 1.13
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------
vector_phi_single/process_time/threads:8 679082 ns 4528424 ns 136
vector_phi_double/process_time/threads:8 774058 ns 6094366 ns 128
vector_theta_single/process_time/threads:8 703891 ns 5364503 ns 136
vector_theta_double/process_time/threads:8 771312 ns 5841033 ns 128
vector_perp_single/process_time/threads:8 44301 ns 321502 ns 2368
vector_perp_double/process_time/threads:8 75251 ns 534540 ns 1304
vector_norm_single/process_time/threads:8 57866 ns 431679 ns 1728
vector_norm_double/process_time/threads:8 87483 ns 540028 ns 1080
vector_eta_single/process_time/threads:8 780283 ns 5854399 ns 112
vector_eta_double/process_time/threads:8 1231129 ns 7451987 ns 80
Running ./bin/algebra_benchmark_vc_aos_transform3
Run on (8 X 4400 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB (x1)
Load Average: 0.65, 0.83, 1.10
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------
vector_transform3_single/process_time/threads:8 332085 ns 2240918 ns 312
vector_transform3_double/process_time/threads:8 550165 ns 3902846 ns 168
Based on #97 , so that the storage vector type can be reused and vc_array4
replaced with something more generic. This will be helpful for generalizing the plugin to higher dimensional types later
It seems Vc AoS gets slower with double for some cases, is that normal?
It seems Vc AoS gets slower with double for some cases, is that normal?
It is expected for vectorized code to be about half as slow in double precision as in single precision (half the number of values fits into the same number of bits in the registers). I believe that the cmath plugin is slower in double precision could be hint that it is in fact partly autovectorized. Why Vc AoS is so much slower than cmath in double precision I don't know, but this is what I have seen before as well.
Yes I was asking why Vc double is slower than cmath double (not w.r.t float). Thanks for the answer
These benchmarks are a bit outdated though (and also done with CPU scaling, so that e.g. addition and subtraction show different results). It works much better running:
./bin/algebra_benchmark_array_vector --benchmark_min_warmup_time=1000 --benchmark_repetitions=50 --benchmark_time_unit=ms --benchmark_display_aggregates_only=true --benchmark_enable_random_interleaving=true
with CPU scaling disabled
The CI failure does not seem to be related to this PR, see #123
This refactors the Vc AoS plugin to use explicitly vectorizing Vc types again. Also adds it to the benchmarks and tests. Since not all functionality is implemented, yet, I split the test suite into three blocks, so that new plugins can be implemented incrementally with testing enabled. Finally, I harmonized the naming between this plugin and the new SoA plugin.
Also removes the warmup from the bencharmks, since google benchmark can be configured to do that for us.
Edit: I refactored the
vc_aos::transform3
andmatrix44
types, so that are shared between Vc AoS and SoA now