Open fredrik-johansson opened 1 year ago
I would expect the difference to be minimal without unrolling the for loop on k
, as I would expect the compiler to already vectorize that without a problem.
Or am I missing something?
This is just an example; there are lots of permutations of manual unrolling and blocking worth trying. It's not at all clear a priori what works best, and I doubt that the compiler knows either.
Unrolling into separate sums is useful for moduli approaching 32 or 26 bits because we can support larger N without requiring modular reduction or spilling over to a double word. Actually my original motivation for this test was to investigate various strategies for moduli around 30 bits.
Here are benchmarks for matrix multiplication and Gaussian elimination on word-size prime fields, using FLINT (with / without BLAS), using NTL, and using FFLAS-FFPACK (several implementations depending on size of field; however I did not call the "multi-precision'' variants so this is limited to about 30 bit primes).
For matrix multiplication I included square and rectangular cases, including matrix-vector and vector-matrix products. For Gaussian elimination I also used several dimension profiles, as well as a varying rank to see how good the implementations are in terms of rank sensitivity.
All of this was on a computer with AVX-512, see the beginning of the files for more details on the machine. I'll do the same measurements on a laptop without AVX-512 soon.
This is raw data, which is not so readable, but we can already see a few trends. I'll try to extract some plots out of this, to get a better view of the main points where the different libraries' performances diverge. Examples of visible trends:
Warning: small typo, in the matmul
the columns should be rdim idim cdim
(row, inner, column dimensions) instead of rank rdim cdim
.
Addition: similar benchmarks, but on some processor without AVX512. At first sight, some observations above seem to remain valid. Surprisingly this seems to indicate that (on this machine) compiling FLINT with BLAS does not provide much advantage for matrix multiplication, when the field prime is beyond 20 bits. matmul_lu_noavx512.zip
The basecases for functions like
nmod_mat_mul
,fmpz_mat_mul
,nmod_poly_mul
andfmpz_poly_mul
can be sped up significantly for half-word-size entries (and maybe bigger entries with more effort) by writing vectorization-friendly code and compiling with-march=native -O3
.Some naive code just for illustration:
Sample results on my machine: at least in some cases, we can get close to a factor 2 speedup with vectorized
ulong
math, and a factor 4 speedup withdouble
. This ignores conversion and transposition costs, which may not be negligible.