Closed beomki-yeo closed 4 months ago
Currently I only optimized the addition and constant multiplication. Vectorization in the matrix multiplication makes the program slower as far as I can see.
Before Optimization
CPU unsync propagation/8 223808 ns 223285 ns 3101 TracksPropagated=286.629k/s
CPU unsync propagation/16 959241 ns 955581 ns 708 TracksPropagated=267.9k/s
CPU unsync propagation/32 3838758 ns 3756516 ns 178 TracksPropagated=272.593k/s
CPU unsync propagation/64 14608613 ns 14524319 ns 45 TracksPropagated=282.01k/s
CPU unsync propagation/128 63197599 ns 62829071 ns 12 TracksPropagated=260.771k/s
CPU unsync propagation/256 245358319 ns 244832250 ns 3 TracksPropagated=267.677k/s
CPU sync propagation/8 252922 ns 251167 ns 2626 TracksPropagated=254.81k/s
CPU sync propagation/16 1202544 ns 1146934 ns 634 TracksPropagated=223.204k/s
CPU sync propagation/32 5024259 ns 4963774 ns 186 TracksPropagated=206.295k/s
CPU sync propagation/64 15572888 ns 15382790 ns 47 TracksPropagated=266.272k/s
CPU sync propagation/128 62399932 ns 61958364 ns 11 TracksPropagated=264.436k/s
CPU sync propagation/256 249225684 ns 248533783 ns 3 TracksPropagated=263.691k/s
CUDA unsync propagation/8 4139261 ns 4125437 ns 160 TracksPropagated=15.5135k/s
CUDA unsync propagation/16 10304838 ns 10280975 ns 65 TracksPropagated=24.9004k/s
CUDA unsync propagation/32 11759391 ns 11545430 ns 60 TracksPropagated=88.6931k/s
CUDA unsync propagation/64 15486088 ns 15454851 ns 44 TracksPropagated=265.03k/s
CUDA unsync propagation/128 42272228 ns 41622470 ns 17 TracksPropagated=393.634k/s
CUDA unsync propagation/256 173225146 ns 172866587 ns 4 TracksPropagated=379.113k/s
CUDA sync propagation/8 3932279 ns 3921049 ns 178 TracksPropagated=16.3222k/s
CUDA sync propagation/16 10153322 ns 10131947 ns 67 TracksPropagated=25.2666k/s
CUDA sync propagation/32 10911805 ns 10888519 ns 62 TracksPropagated=94.044k/s
CUDA sync propagation/64 13669611 ns 13640279 ns 50 TracksPropagated=300.287k/s
CUDA sync propagation/128 39348997 ns 39250588 ns 18 TracksPropagated=417.421k/s
CUDA sync propagation/256 161579596 ns 158291125 ns 4 TracksPropagated=414.022k/s
After Optimization
CPU unsync propagation/8 245712 ns 244354 ns 2676 TracksPropagated=261.915k/s
CPU unsync propagation/16 940747 ns 937644 ns 700 TracksPropagated=273.025k/s
CPU unsync propagation/32 4044877 ns 3915183 ns 188 TracksPropagated=261.546k/s
CPU unsync propagation/64 13939578 ns 13733911 ns 48 TracksPropagated=298.24k/s
CPU unsync propagation/128 58935459 ns 56378559 ns 11 TracksPropagated=290.607k/s
CPU unsync propagation/256 229938854 ns 228334281 ns 3 TracksPropagated=287.018k/s
CPU sync propagation/8 261180 ns 260296 ns 2528 TracksPropagated=245.874k/s
CPU sync propagation/16 947878 ns 942067 ns 724 TracksPropagated=271.743k/s
CPU sync propagation/32 5582073 ns 5339018 ns 100 TracksPropagated=191.796k/s
CPU sync propagation/64 14422517 ns 14356729 ns 48 TracksPropagated=285.302k/s
CPU sync propagation/128 68016045 ns 60527760 ns 11 TracksPropagated=270.686k/s
CPU sync propagation/256 243352156 ns 238947192 ns 3 TracksPropagated=274.27k/s
CUDA unsync propagation/8 10978744501 ns 10920950916 ns 1 TracksPropagated=5.8603/s
CUDA unsync propagation/16 10598858 ns 10279684 ns 49 TracksPropagated=24.9035k/s
CUDA unsync propagation/32 11414197 ns 11390716 ns 59 TracksPropagated=89.8978k/s
CUDA unsync propagation/64 15507599 ns 15278148 ns 45 TracksPropagated=268.095k/s
CUDA unsync propagation/128 41922541 ns 41473470 ns 17 TracksPropagated=395.048k/s
CUDA unsync propagation/256 167457081 ns 167079324 ns 4 TracksPropagated=392.245k/s
CUDA sync propagation/8 4023485 ns 4010791 ns 174 TracksPropagated=15.957k/s
CUDA sync propagation/16 9929345 ns 9907102 ns 68 TracksPropagated=25.84k/s
CUDA sync propagation/32 10773593 ns 10748340 ns 63 TracksPropagated=95.2705k/s
CUDA sync propagation/64 13444032 ns 13413441 ns 50 TracksPropagated=305.365k/s
CUDA sync propagation/128 39619134 ns 39513598 ns 18 TracksPropagated=414.642k/s
CUDA sync propagation/256 161714244 ns 161347023 ns 4 TracksPropagated=406.18k/s
VTune analyzer result with detray integration test with rk propagation & constant bfield
Before optimization
After optimization
Is this single or double precision? You may not see that much of a speedup in double precision
It's double. I will check single later
Are you compiling with support for vector extensions enabled?
Maybe not. I thought Release build would do that. Do you know how to enable it?
Maybe this?
-ftree-vectorize -march=native -mtune=native -mavx2
You can also ask the compiler to tell you about vectorization optimizations it did -fopt-info-vec-all
and similar
Another thing to consider is data alignment maybe
Intel advisor report of algebra_test_array
(compiled with "-march=native" "-ftree-vectorize"
) says that the matrix multiplication is vectorized:
And as expected, it is reported that there are bunch of non-vectorized operations with partial_pivot_lud
: https://github.com/acts-project/algebra-plugins/blob/9e684de5661ca9c1c216d0631e24951ffa10d9c9/math/cmath/include/algebra/math/algorithms/matrix/inverse/partial_pivot_lud.hpp#L67
Partial Pivot LU decomposition is written in a totally anti-vectorized manner: The sub vector matrix represents a column of the matrix and we do the Gaussian elimination row-wisely. We will need to rewirte the partial pivot lud later with column-wise gauss elimination.
I will skip partial pivot lud optimization in this PR as it is going to require a significant amount of time
Free lunch time
What's changed?
Just to help understanding what's going on with new matrix multiplication.
It should be noted that the cmath matrix is made of multiple arrays where each of them represents each column of matrix. Imagine that we do multiply 4x4 matrix A and B to obtain the matrix C. If we do the matrix multiplication as we usually do like the following figure, we will have to pick up the row of A which cannot be vectorized:
So I came up with an idea to do the matrix multiplication column-wisely by multiplying a column of A to the element of B:
The following figure shows how the column of C is calculated in the new matrix multiplication