Open vneiger opened 2 weeks ago
The current commit is not bringing AVX and does not make any change in the "bound on number of limbs" logic. In my experiments it is mostly slightly faster (in the above-mentioned "new" cases), sometimes marginally slower, than the existing code. For some reason, with clang on zen4 it brings a good improvement for small moduli; I did not check carefully, but I suppose the loop unrolling in the "small modulus" case helps clang to auto-vectorize.
Question: for a small modulus (< 32 bits) and an unsigned long (possibly up to full 64 bits), is there something faster than NMOD_RED
, e.g. using doubles, for modular reduction? I had a look at https://github.com/flintlib/flint/issues/1010 and some other places but I did not reach a conclusion.
I definitely consider _nmod_vec_dot_bound_limbs
to be an implementation detail. Feel free to refactor the case selection in any way that makes sense.
Apart from nmod_mat_mul
, some useful functions to benchmark are _nmod_poly_mul_classical
, _nmod_poly_inv_series_basecase_preinv1
and _nmod_poly_exp_series
.
Edit: also nmod_mat_solve_tril
/ nmod_mat_solve_triu
.
Question: for a small modulus (< 32 bits) and an unsigned long (possibly up to full 64 bits), is there something faster than
NMOD_RED
, e.g. using doubles, for modular reduction? I had a look at #1010 and some other places but I did not reach a conclusion.
I don't think so. IIRC, I tried at some point to do a 32-bit version of NMOD_RED
but didn't notice any speedup. Maybe I missed some trick or that was just my machine though. I imagine that you can do several reductions in parallel a bit faster using avx2.
BTW, we definitely want to leverage the generics system to have dedicated 8-bit, 16-bit and 32-bit Z/nZ implementations (some code is already in place). Of course, that goal shouldn't obstruct optimizing the existing nmod
format for small moduli.
Regarding doubles for modular reduction: I suspect with ulong
input and output, the conversions are going to kill the advantage. What we probably want is rather a dedicated double
based Z/nZ implementation using the same kind of machinery as in fft_small
.
Thanks for your input!
Currently in gr/nmod.c
there is some special code to select an algorithm quickly for short dot products. Maybe such a special case should be in _nmod_vec_dot_params
.
@fredrik-johansson thanks for your fix on profile/p-dot.c .
Currently in gr/nmod.c there is some special code to select an algorithm quickly for short dot products. Maybe such a special case should be in _nmod_vec_dot_params.
Ok, I will look into this. This is about ll864--898 of gr/nmod.c, I suppose. This was among my list of things to clarify before finishing this PR. For short dot products, and for code organization in general, it could be nicer to avoid computing len * (mod.n - 1)**2
each time we make a new dot product (_nmod_vec_dot
doesn't, but __gr_nmod_vec_dot
does, and also we do it implicitly if we call many times, on small input, functions that use dot products).
One approach: once the modulus is fixed, we could "precompute" ranges of dot product lengths for each number of limbs. For example, say up to length 30 then 2 limbs, and beyond that, 3 limbs; or up to length 115 then 1 limb, and beyond that, 2 limbs. If I'm not mistaken, for a given modulus we cannot have lengths (> 0) that are done with 1 limb and others with 3 limbs. So the precomputation could simply yield one length threshold, separating 1->2 limbs (if the modulus is <= 232) or 2->3 limbs (if the modulus is > 232, forbidding any length > 0 to be done with a single limb).
One would have to store this threshold somewhere. Could it be simply in _dot_params_t
? Then an element of this type would be initialized from the modulus, and would contain this threshold. Then _nmod_vec_dot_params
would be "simplified", with no computation of len * (mod.n - 1)**2
and only some if conditions with ulong comparisons. (One should still try to avoid repeated calls when possible...)
Is there any downside about adding a third field to this _dot_params
struct?
Some early, non-final comparison (new time / old time) via p-dot.c, on an IceLake server. The first line, 0, is for the modulus 2**63. Altogether, there is nice speed-up where expected, and sometimes some regression.
#0 --> vec dot
#bits\len 1 5 10 25 50 100 250 500 1000 2500 5000 10000 100000 1000000
0 1.50 1.89 1.86 2.31 2.59 3.39 3.24 3.58 3.38 3.61 3.70 3.71 1.76 1.18
12 1.05 0.97 0.95 1.07 0.80 1.35 1.29 1.62 1.69 1.63 1.14 1.13 0.89 0.90
28 1.03 0.98 0.96 1.07 0.86 1.33 1.29 1.60 1.68 4.38 3.39 3.40 1.38 1.00
30 1.03 0.97 0.94 1.07 0.80 2.63 2.86 3.91 4.17 4.33 3.39 3.40 1.49 1.00
31 1.02 0.97 0.96 1.39 1.46 2.59 2.87 3.90 4.16 4.37 3.39 3.40 1.38 1.00
32 1.04 0.98 1.09 0.97 0.90 0.95 0.95 0.97 0.98 0.97 0.99 0.99 0.96 0.91
40 1.04 0.98 1.03 0.97 0.94 0.99 0.90 0.88 0.87 0.86 0.95 0.95 0.97 0.94
50 1.03 0.98 1.02 0.97 0.94 0.99 0.89 0.89 0.87 0.85 0.96 0.95 0.97 0.94
60 1.02 0.98 1.04 0.97 0.93 0.99 0.90 0.89 0.87 1.29 1.34 1.35 1.28 1.13
61 1.02 0.99 1.02 0.97 0.94 0.99 0.90 1.28 1.29 1.29 1.34 1.35 1.29 1.13
62 1.02 0.98 1.02 0.97 0.94 1.17 1.22 1.28 1.29 1.29 1.34 1.35 1.30 1.15
63 1.02 0.98 1.03 1.02 1.02 1.17 1.22 1.28 1.29 1.28 1.35 1.35 1.28 1.13
#6 --> mat_mul
#bits\len 1 5 10 25 50 100 250 500 1000 2500 5000 10000 100000 1000000
0 1.01 0.98 0.99 0.98 0.98 0.98 0.95 0.87 0.85
12 1.01 1.00 0.99 0.97 0.98 0.99 1.01 0.98 0.91
28 1.01 1.00 0.99 0.98 0.97 0.99 1.01 0.93 0.92
30 1.02 1.00 0.99 0.98 0.97 2.35 2.87 1.91 1.89
31 1.02 1.00 0.99 1.31 1.54 2.35 2.86 1.94 1.90
32 1.02 1.01 0.97 1.01 0.96 0.96 0.96 0.96 0.95
40 1.05 1.01 0.99 0.99 1.02 1.04 0.97 0.96 0.95
50 1.05 1.01 0.98 0.99 1.02 1.04 0.97 0.96 0.95
60 1.04 1.01 0.98 0.99 1.02 1.04 0.97 0.96 0.95
61 1.05 1.01 0.99 0.98 1.03 1.04 0.97 1.28 1.29
62 1.05 0.99 0.99 0.99 1.02 1.19 1.26 1.30 1.29
63 1.03 1.01 0.99 1.02 1.10 1.19 1.26 1.28 1.29
What about splitting the dot-product into different functions? Regardless, it looks good!
I'm a little bit concerned why we get a slowdown around 1000 in length. Do you know why that is?
What about splitting the dot-product into different functions? Regardless, it looks good!
Yes, we would still need the general function which calls the specific algorithms, but I agree it could be make the code clearer. And could be better performance-wise for some purposes: e.g. in _nmod_mat_addmul_transpose_op
my early experiments showed it would be better to have the dot algorithm selection outside the two for loops, and inside these loops call directly the specific function among those that you are suggesting to create (maybe the branching of _nmod_vec_dot
costs a bit in the current implementation, I have not checked).
I'm a little bit concerned why we get a slowdown around 1000 in length. Do you know why that is?
Unfortunately not. And it is not a general, easily reproducible fact. On my laptop (zen4) it looks like this:
#0 --> vec dot
#bits\len 1 5 10 25 50 100 250 500 1000 2500 5000 10000 100000 1000000
0 1.18 2.08 2.25 3.54 4.51 6.27 5.55 6.92 6.80 5.02 4.76 4.80 4.07 3.12
12 0.82 0.87 0.93 0.93 1.04 1.05 1.03 1.08 1.06 1.16 1.01 1.08 1.00 0.96
28 0.86 0.88 0.91 0.94 0.99 1.04 1.04 1.04 1.07 3.56 3.20 3.27 2.41 1.97
30 0.82 0.86 0.93 0.92 1.02 3.24 3.74 4.13 4.62 3.55 3.17 3.24 2.48 1.99
31 0.82 0.85 0.93 1.60 2.34 3.28 3.63 4.18 4.58 3.68 3.13 3.27 2.63 1.96
32 0.82 1.04 1.01 1.04 1.09 0.99 1.08 1.00 1.12 1.06 1.08 1.06 1.11 1.05
40 0.90 0.86 1.05 1.12 1.13 1.08 1.09 1.12 1.10 1.13 1.11 1.17 1.16 1.15
50 0.88 0.87 1.05 1.11 1.06 1.10 1.08 1.09 1.10 1.13 1.15 1.17 1.15 1.17
60 0.88 0.87 1.06 1.11 1.08 1.10 1.09 1.09 1.13 1.25 1.22 1.27 1.25 1.27
61 0.87 0.88 1.04 1.11 1.06 1.10 1.09 1.21 1.25 1.25 1.26 1.25 1.27 1.24
62 0.89 0.85 1.04 1.10 1.11 1.20 1.27 1.28 1.22 1.25 1.30 1.24 1.27 1.24
63 0.89 0.87 1.04 1.14 1.21 1.19 1.28 1.23 1.24 1.25 1.26 1.28 1.23 1.28
Both this one and the previous timings are on machines with avx512, which can play a role in particular for the 12-bit case where auto-vectorization works well. Here, on another machine without avx512, which does not have the "length 1000" issue either:
#0 --> vec dot
#bits\len 1 5 10 25 50 100 250 500 1000 2500 5000 10000 100000 1000000
0 1.17 1.96 1.70 2.06 2.21 2.45 2.48 2.51 2.65 2.50 2.51 2.44 1.77 1.59
12 0.82 0.96 1.01 1.17 1.38 1.99 1.99 2.23 2.27 1.82 1.78 1.66 1.17 1.34
28 0.80 0.94 1.02 1.18 1.39 1.98 1.95 2.16 2.25 2.78 2.81 2.77 1.58 1.65
30 0.79 0.95 1.02 1.20 1.39 2.38 2.66 3.35 3.73 2.81 2.82 2.58 1.68 1.58
31 0.79 0.94 1.03 1.25 1.57 2.33 2.67 3.40 3.72 2.82 2.84 2.68 1.68 1.65
32 0.80 1.01 1.01 0.89 0.93 0.95 0.94 0.93 1.02 0.98 1.01 1.01 0.92 0.98
40 0.95 1.01 1.04 1.08 1.10 1.08 1.07 1.07 1.07 1.12 1.06 1.07 1.13 1.15
50 0.95 1.02 1.03 1.09 1.10 1.08 1.07 1.06 1.07 1.11 1.09 1.11 1.14 1.13
60 0.95 1.02 1.02 1.08 1.09 1.08 1.08 1.07 1.06 1.39 1.39 1.38 1.32 1.32
61 0.94 1.04 1.03 1.07 1.10 1.10 1.08 1.33 1.33 1.41 1.38 1.35 1.39 1.29
62 0.95 1.03 1.03 1.09 1.10 1.24 1.32 1.35 1.36 1.39 1.39 1.39 1.33 1.31
63 0.94 1.04 1.03 1.07 1.20 1.25 1.31 1.31 1.34 1.39 1.38 1.39 1.34 1.32
Here, on another machine without avx512, which does not have the "length 1000" issue either
...
Looks like it still has the same problem, just that it is faster than the previous version and does not slow down as much. It is not important IMO, it could be a profiler thing. Just curious.
Update on the state of this:
dot_params
selection was beneficial, as well as handling small lengths explicitlyThis feels close to complete for me, so if you have comments/suggestions, please let me know.
@fredrik-johansson
Currently in gr/nmod.c there is some special code to select an algorithm quickly for short dot products. Maybe such a special case should be in _nmod_vec_dot_params.
I have reworked the parameter selection a bit, so that it does less work than it used to for moduli of 32 bits or less. For the small length case, I have made several attempts, measuring what this could gain, and this did not help as much as I expected. For example I thought it would speed up doing many 2 x 2 matrix multiplications done successively, since this calls _nmod_vec_dot_params
for each product... but it remains a negligible part of the computation already there. It does help a bit (20-30%) for computing a single dot_params + dot
at a small length (say < 10), so I will still have a last attempt at this before finishing this PR.
@albinahlback
we get a slowdown around 1000 in length. Do you know why that is?
Maybe I misunderstood your point: I thought you mentioned length specifically 1000, and the slowdown we observed compared to the previous version (mat_mul has some < 1.00 ratios at length 1000 in the first timings above). If your remark was rather that for length around 1000 and more, then the ratio new version / old version becomes less interesting, yes this looks sensible, due to memory accesses. The dot product --- in particular in single limb cases, when the modulus is < 32 bits --- is quite memory intensive: for each product you load two coefficients, and you never re-use them (if doing avx2, for each 4 products via a single _m256i_mul_epu32
you have to load 2 * 256 bits from memory). So a (non-verified) explanation could be: from length around 1000-2000, we need to go beyond the L1 cache, and things become really dominated by the memory accesses. Or something of this kind :-) .
Current comparisons between this branch and the main branch (ratio new / old), on 3 machines. Using gcc in all cases. The last two machines have avx512, the first one does not and it seems that gcc does not vectorize well the current dot product on that first machine (this is about the single limb case only: gcc does vectorize it reasonably well on the avx512 machines).
Intel(R) Xeon(R) CPU E7-4820 v4 dot_posso_cmp.txt
Intel(R) Xeon(R) Gold 6354 dot_groebner_cmp.txt
AMD Ryzen 7 PRO 7840U dot_znver4_cmp.txt
Comparison after the last changes (ratios new times vs. times for main branch). The better handling of short dot products does help for polynomial routines. I also inserted comparisons of dot product times including the call to _nmod_vec_dot_params
.
Intel(R) Xeon(R) CPU E7-4820 v4 dot_posso_cmp.txt
Intel(R) Xeon(R) Gold 6354 dot_groebner_cmp.txt
AMD Ryzen 7 PRO 7840U dot_znver4_cmp.txt
Perhaps we should keep _nmod_vec_dot_bound_limbs
for uses like lu_classical_delayed
where we want to bound the size of a dot product but where we don't actually call _nmod_vec_dot
(or one of its variants)? Though maybe we want to add more versions of lu_classical_delayed
as well...
For the documentation, does it sound reasonable to not go into the details of the implementation (specific algorithms, choices, ...), and only mention the main interfaces such as _nmod_vec_dot
and NMOD_VEC_DOT
?
In gr/nmod.c
, the short dot algorithm selection should not be needed anymore, since it is incorporated in the general parameters selection. So I removed it.
Currently this gr/nmod.c
file uses the macro NMOD_VEC_DOT
, could we instead use the function _nmod_vec_dot
? (I'm not very familiar with the whole gr
framework, so I want to avoid breaking anything.) The reason is that the avx-vectorization is not done in the macro, as it does not play nicely with it. The macro accepts a general expression as a function of the vector index i, but for loading data into avx registers we need to know a bit more about this function of i. Cf the differences of implementation of _nmod_vec_dot2_split
, _nmod_vec_dot2_rev_split
, _nmod_vec_dot2_ptr_split
in nmod_vec/dot.c
: the latter is less efficient due to the non-contiguous load and this is what would happen to the general macro if we tried to put a general avx version therein.
(I suppose the documentation should still contain a word about the types dot_method_t
and dot_params_t
.)
Updated exp_series comparisons: (for that one, and for small len, some timings are quite off in one way or the other, not consistently... but apart from that, things look similar to before or faster)
Ryzen 7 7840U
#8 --> poly_exp_series
#bits\len 1 2 3 4 5 7 10 15 25 35 50 100 250 500 1000 2500 5000 10000 100000 1000000
12 1.10 0.55 1.27 1.19 1.07 1.11 1.06 1.09 1.14 1.13 1.08 1.11 1.07 0.98 0.94 0.92 0.96 0.98
28 1.04 1.03 7.11 0.60 1.38 1.20 1.22 1.09 1.02 1.10 1.09 1.10 1.06 1.00 0.94 3.65 3.78 3.84
30 1.02 0.99 0.98 1.11 0.16 0.39 2.51 1.13 1.10 1.10 1.05 1.54 2.54 3.15 3.60 4.54 3.94 3.69
31 1.08 0.38 1.02 1.80 1.11 1.12 1.05 1.10 1.10 1.18 1.28 1.58 2.78 3.03 3.90 3.99 3.91 3.88
32 1.02 2.30 1.43 0.57 0.62 0.89 1.03 0.99 1.03 1.08 1.12 1.24 1.52 1.31 1.47 1.40 1.45 1.38
40 1.03 0.44 0.67 1.62 0.81 0.90 1.12 0.91 0.97 0.93 1.07 0.94 1.02 1.02 1.03 1.04 1.04 1.07
50 1.06 1.00 7.18 0.99 1.02 1.03 1.01 0.97 0.94 0.97 0.98 0.99 1.12 1.03 1.04 1.10 1.05 1.10
60 1.01 1.07 0.96 0.55 14.35 1.01 1.14 0.90 0.95 0.93 1.04 0.94 1.03 1.02 1.02 1.27 1.30 1.35
61 1.06 0.37 0.14 1.81 0.07 0.99 1.13 0.89 0.97 0.78 1.08 0.99 1.02 1.22 1.30 1.40 1.30 1.40
62 0.98 1.01 7.62 0.98 1.14 2.46 0.98 0.97 0.96 0.83 0.96 1.12 1.26 1.26 1.30 1.43 1.31 1.41
63 0.99 1.03 1.03 0.92 0.45 1.02 1.11 1.13 1.06 1.06 1.11 1.18 1.23 1.32 1.23 1.34 1.35 1.31
64 1.03 1.01 0.18 2.01 2.29 1.06 1.07 1.00 1.00 1.03 1.11 1.11 1.18 1.16 1.30 1.27 1.28 1.30
Xeon E7-4820
#8 --> poly_exp_series
#bits\len 1 2 3 4 5 7 10 15 25 35 50 100 250 500 1000 2500 5000 10000 100000 1000000
12 1.00 2.40 1.01 0.95 1.04 0.98 1.00 0.98 1.02 1.05 1.04 1.09 1.02 1.02 0.99 0.98 0.96 0.95
28 1.01 2.40 0.84 0.77 1.03 0.03 0.98 1.04 1.01 1.03 1.05 1.07 1.03 1.02 0.99 1.45 1.31 1.27
30 1.00 1.00 1.62 0.99 0.30 0.97 1.04 1.04 1.04 1.06 1.03 1.19 1.38 1.55 1.66 1.60 1.33 1.28
31 1.00 0.42 10.81 1.01 0.76 0.22 1.04 1.03 1.03 1.09 1.12 1.27 1.43 1.55 1.66 1.60 1.34 1.28
32 0.99 2.42 0.11 2.62 1.01 1.02 0.99 0.99 0.99 0.97 0.93 0.89 0.83 0.73 0.70 0.68 0.67 0.66
40 1.00 2.41 0.99 2.05 0.95 0.88 0.98 1.00 1.03 1.05 1.06 1.11 1.14 1.07 1.07 1.05 1.00 0.98
50 1.00 0.42 0.51 6.01 0.97 1.01 0.91 1.00 1.04 1.06 1.07 1.11 1.14 1.07 1.07 1.04 0.99 0.98
60 0.99 1.02 1.00 0.98 1.09 1.01 1.02 1.00 1.02 1.04 1.06 1.10 1.13 1.07 1.07 1.72 1.76 1.78
61 0.99 0.42 10.89 0.98 0.91 1.00 1.03 1.01 1.01 1.04 1.05 1.11 1.13 1.55 1.77 1.84 1.80 1.78
62 1.00 1.01 1.64 2.72 1.30 0.99 1.04 1.00 1.02 1.05 1.07 1.28 1.57 1.71 1.79 1.85 1.79 1.78
63 1.00 2.41 1.62 1.04 0.26 1.04 1.00 1.01 1.08 1.14 1.21 1.39 1.61 1.72 1.81 1.85 1.78 1.78
64 1.00 0.99 1.01 12.94 3.55 5.19 1.03 1.14 1.02 1.05 1.03 1.04 1.02 0.98 0.98 0.98 0.98 0.98
This now seems ready for review.
It looks very good! Would be nice to get more consistent timings for the small cases though.
I will give the code a careful review soon.
I will give the code a careful review soon.
Thanks!
It looks very good! Would be nice to get more consistent timings for the small cases though.
Yes I agree... fortunately this is only for the exp_series bench (all other functions are pretty consistent including for the smallest lengths). This has been consistently inconsistent since the beginning of my work on this PR. You can see below the timings for the current main branch, this already is inconsistent:
#8 --> poly_exp_series
#bits\len 1 2 3 4 5
12 6.94e-09 6.93e-09 6.82e-09 3.61e-08 8.05e-08
28 5.17e-09 5.22e-09 3.81e-08 3.13e-08 6.01e-08
30 5.17e-09 5.31e-09 2.75e-08 5.67e-08 5.18e-09
31 5.44e-09 5.39e-09 5.18e-09 5.71e-08 7.56e-08
32 5.18e-09 1.22e-08 3.99e-08 3.06e-08 4.12e-08
40 5.17e-09 5.31e-09 2.75e-08 5.01e-08 6.11e-08
50 5.33e-09 1.22e-08 3.67e-08 5.68e-08 7.43e-08
60 5.32e-09 5.33e-09 3.83e-08 3.05e-08 7.55e-08
61 5.33e-09 5.23e-09 5.20e-09 5.65e-08 5.18e-09
62 5.19e-09 1.22e-08 3.94e-08 4.83e-08 7.23e-08
63 5.19e-09 1.22e-08 2.89e-08 3.06e-08 3.34e-08
64 5.18e-09 5.31e-09 5.18e-09 6.17e-08 8.15e-08
By the way, here are the timings (before taking ratios), in case that can help. And the script I used to compute the ratios. dot_bench.zip
The goal is to introduce the following, for nmod vec's:
Good avx2 performance when the modulus is "small". This roughly means moduli with at most 32 bits, but for efficiency reasons the limit would be slightly lower, a bit above 31 bits. Based on draft code, the target of this PR is roughly up to 2**30.5 (trying to go beyond, it is difficult to maintain the same efficiency).
Better performance when the dot product fits on 3 limbs but we still have enough room to accumulate a few 2-limb products before using add_sssaaaaaa. Basically the target is to avoid any performance reduction when we are in the region of 50- to 62-bit moduli (depending on the length).
One foreseeable issue is that we do not want more "complex" code (e.g. more branching) to penalize small instances, since this is code which is actually used a lot for small instances. For example an indirect target is to accelerate matrix multiplication or matrix-vector products (when the modulus is close to 32 bits or close to 64 bits): there one computes many dot products of the same type (same length, same modulus), and the new code should avoid any performance decrease for small-ish matrices (e.g. 50x50).
Questions, written here for help and suggestions:
Regarding benchmarks: measuring the performance of dot products for small lengths is not that trivial, and in fact maybe not that meaningful: what matters the most is the performance within code that uses such products. Apart from matrix multiplication things, could someone suggest a few additional flint routines that could be targeted to get confident that no performance regression is introduced?
The current flint NMOD_VEC_DOT macro has 4 or 5 cases already (based on number of limbs of the product, but also other things). The new code has a couple more. The plan would be to slightly change the logic, which is currently to compute the number of limbs and use it to separate cases, and then again separate some additional cases within this. One could rather find directly a "case number" ("one limb", "small modulus", "two limbs", "three limbs but room for accumulation", etc.). In my draft experiments, this helps performance for small dot products. This means some code changes here and there. Something that is not clear to me: should a change in
_nmod_vec_dot_bound_limbs
be done via some deprecation? (I'm wondering whether this helper function is considered as, say, "for internal use, subject to change")In addition, for "small" modulus, the new code would require(*) to compute some integer like 2**56 mod n, i.e. which depends only on the modulus n. Recomputing this for every new dot product seems wasteful for repeated dot products of the same length/modulus, so this computation may as well be incorporated in the new "case distinguishing" approach: if we see we are in the "small modulus case", then compute this number and return it somehow.
(*) the non-avx version should not really require it, but the avx version does.