Vectorize the cmath matrix operations

beomki-yeo commented 5 months ago

Free lunch time

What's changed?

Just to help understanding what's going on with new matrix multiplication.

It should be noted that the cmath matrix is made of multiple arrays where each of them represents each column of matrix. Imagine that we do multiply 4x4 matrix A and B to obtain the matrix C. If we do the matrix multiplication as we usually do like the following figure, we will have to pick up the row of A which cannot be vectorized:

So I came up with an idea to do the matrix multiplication column-wisely by multiplying a column of A to the element of B:

     for (size_type k = 0; k < M; ++k) {
        C[j][k] += A[i][k] * B[j][i];
      }

The following figure shows how the column of C is calculated in the new matrix multiplication

beomki-yeo commented 5 months ago

Currently I only optimized the addition and constant multiplication. Vectorization in the matrix multiplication makes the program slower as far as I can see.

Before Optimization

CPU unsync propagation/8        223808 ns       223285 ns         3101 TracksPropagated=286.629k/s
CPU unsync propagation/16       959241 ns       955581 ns          708 TracksPropagated=267.9k/s
CPU unsync propagation/32      3838758 ns      3756516 ns          178 TracksPropagated=272.593k/s
CPU unsync propagation/64     14608613 ns     14524319 ns           45 TracksPropagated=282.01k/s
CPU unsync propagation/128    63197599 ns     62829071 ns           12 TracksPropagated=260.771k/s
CPU unsync propagation/256   245358319 ns    244832250 ns            3 TracksPropagated=267.677k/s
CPU sync propagation/8          252922 ns       251167 ns         2626 TracksPropagated=254.81k/s
CPU sync propagation/16        1202544 ns      1146934 ns          634 TracksPropagated=223.204k/s
CPU sync propagation/32        5024259 ns      4963774 ns          186 TracksPropagated=206.295k/s
CPU sync propagation/64       15572888 ns     15382790 ns           47 TracksPropagated=266.272k/s
CPU sync propagation/128      62399932 ns     61958364 ns           11 TracksPropagated=264.436k/s
CPU sync propagation/256     249225684 ns    248533783 ns            3 TracksPropagated=263.691k/s
CUDA unsync propagation/8      4139261 ns      4125437 ns          160 TracksPropagated=15.5135k/s
CUDA unsync propagation/16    10304838 ns     10280975 ns           65 TracksPropagated=24.9004k/s
CUDA unsync propagation/32    11759391 ns     11545430 ns           60 TracksPropagated=88.6931k/s
CUDA unsync propagation/64    15486088 ns     15454851 ns           44 TracksPropagated=265.03k/s
CUDA unsync propagation/128   42272228 ns     41622470 ns           17 TracksPropagated=393.634k/s
CUDA unsync propagation/256  173225146 ns    172866587 ns            4 TracksPropagated=379.113k/s
CUDA sync propagation/8        3932279 ns      3921049 ns          178 TracksPropagated=16.3222k/s
CUDA sync propagation/16      10153322 ns     10131947 ns           67 TracksPropagated=25.2666k/s
CUDA sync propagation/32      10911805 ns     10888519 ns           62 TracksPropagated=94.044k/s
CUDA sync propagation/64      13669611 ns     13640279 ns           50 TracksPropagated=300.287k/s
CUDA sync propagation/128     39348997 ns     39250588 ns           18 TracksPropagated=417.421k/s
CUDA sync propagation/256    161579596 ns    158291125 ns            4 TracksPropagated=414.022k/s

After Optimization

CPU unsync propagation/8        245712 ns       244354 ns         2676 TracksPropagated=261.915k/s
CPU unsync propagation/16       940747 ns       937644 ns          700 TracksPropagated=273.025k/s
CPU unsync propagation/32      4044877 ns      3915183 ns          188 TracksPropagated=261.546k/s
CPU unsync propagation/64     13939578 ns     13733911 ns           48 TracksPropagated=298.24k/s
CPU unsync propagation/128    58935459 ns     56378559 ns           11 TracksPropagated=290.607k/s
CPU unsync propagation/256   229938854 ns    228334281 ns            3 TracksPropagated=287.018k/s
CPU sync propagation/8          261180 ns       260296 ns         2528 TracksPropagated=245.874k/s
CPU sync propagation/16         947878 ns       942067 ns          724 TracksPropagated=271.743k/s
CPU sync propagation/32        5582073 ns      5339018 ns          100 TracksPropagated=191.796k/s
CPU sync propagation/64       14422517 ns     14356729 ns           48 TracksPropagated=285.302k/s
CPU sync propagation/128      68016045 ns     60527760 ns           11 TracksPropagated=270.686k/s
CPU sync propagation/256     243352156 ns    238947192 ns            3 TracksPropagated=274.27k/s
CUDA unsync propagation/8   10978744501 ns   10920950916 ns            1 TracksPropagated=5.8603/s
CUDA unsync propagation/16    10598858 ns     10279684 ns           49 TracksPropagated=24.9035k/s
CUDA unsync propagation/32    11414197 ns     11390716 ns           59 TracksPropagated=89.8978k/s
CUDA unsync propagation/64    15507599 ns     15278148 ns           45 TracksPropagated=268.095k/s
CUDA unsync propagation/128   41922541 ns     41473470 ns           17 TracksPropagated=395.048k/s
CUDA unsync propagation/256  167457081 ns    167079324 ns            4 TracksPropagated=392.245k/s
CUDA sync propagation/8        4023485 ns      4010791 ns          174 TracksPropagated=15.957k/s
CUDA sync propagation/16       9929345 ns      9907102 ns           68 TracksPropagated=25.84k/s
CUDA sync propagation/32      10773593 ns     10748340 ns           63 TracksPropagated=95.2705k/s
CUDA sync propagation/64      13444032 ns     13413441 ns           50 TracksPropagated=305.365k/s
CUDA sync propagation/128     39619134 ns     39513598 ns           18 TracksPropagated=414.642k/s
CUDA sync propagation/256    161714244 ns    161347023 ns            4 TracksPropagated=406.18k/s

VTune analyzer result with detray integration test with rk propagation & constant bfield

Before optimization

After optimization

niermann999 commented 5 months ago

Is this single or double precision? You may not see that much of a speedup in double precision

beomki-yeo commented 5 months ago

It's double. I will check single later

stephenswat commented 5 months ago

Are you compiling with support for vector extensions enabled?

beomki-yeo commented 5 months ago

Maybe not. I thought Release build would do that. Do you know how to enable it?

Maybe this? -ftree-vectorize -march=native -mtune=native -mavx2

niermann999 commented 5 months ago

You can also ask the compiler to tell you about vectorization optimizations it did -fopt-info-vec-all and similar

niermann999 commented 5 months ago

Another thing to consider is data alignment maybe

beomki-yeo commented 5 months ago

Intel advisor report of algebra_test_array (compiled with "-march=native" "-ftree-vectorize") says that the matrix multiplication is vectorized:

And as expected, it is reported that there are bunch of non-vectorized operations with partial_pivot_lud: https://github.com/acts-project/algebra-plugins/blob/9e684de5661ca9c1c216d0631e24951ffa10d9c9/math/cmath/include/algebra/math/algorithms/matrix/inverse/partial_pivot_lud.hpp#L67

Partial Pivot LU decomposition is written in a totally anti-vectorized manner: The sub vector matrix represents a column of the matrix and we do the Gaussian elimination row-wisely. We will need to rewirte the partial pivot lud later with column-wise gauss elimination.

beomki-yeo commented 5 months ago

I will skip partial pivot lud optimization in this PR as it is going to require a significant amount of time

acts-project / algebra-plugins

Vectorize the cmath matrix operations #116

What's changed?