flintlib / flint

FLINT (Fast Library for Number Theory)
http://www.flintlib.org
GNU Lesser General Public License v3.0
401 stars 235 forks source link

nmod_vec_dot_product enhancements (including avx2 for small moduli) #2018

Open vneiger opened 2 weeks ago

vneiger commented 2 weeks ago

The goal is to introduce the following, for nmod vec's:

  1. Good avx2 performance when the modulus is "small". This roughly means moduli with at most 32 bits, but for efficiency reasons the limit would be slightly lower, a bit above 31 bits. Based on draft code, the target of this PR is roughly up to 2**30.5 (trying to go beyond, it is difficult to maintain the same efficiency).

  2. Better performance when the dot product fits on 3 limbs but we still have enough room to accumulate a few 2-limb products before using add_sssaaaaaa. Basically the target is to avoid any performance reduction when we are in the region of 50- to 62-bit moduli (depending on the length).

One foreseeable issue is that we do not want more "complex" code (e.g. more branching) to penalize small instances, since this is code which is actually used a lot for small instances. For example an indirect target is to accelerate matrix multiplication or matrix-vector products (when the modulus is close to 32 bits or close to 64 bits): there one computes many dot products of the same type (same length, same modulus), and the new code should avoid any performance decrease for small-ish matrices (e.g. 50x50).

Questions, written here for help and suggestions:

(*) the non-avx version should not really require it, but the avx version does.

vneiger commented 2 weeks ago

The current commit is not bringing AVX and does not make any change in the "bound on number of limbs" logic. In my experiments it is mostly slightly faster (in the above-mentioned "new" cases), sometimes marginally slower, than the existing code. For some reason, with clang on zen4 it brings a good improvement for small moduli; I did not check carefully, but I suppose the loop unrolling in the "small modulus" case helps clang to auto-vectorize.

Question: for a small modulus (< 32 bits) and an unsigned long (possibly up to full 64 bits), is there something faster than NMOD_RED, e.g. using doubles, for modular reduction? I had a look at https://github.com/flintlib/flint/issues/1010 and some other places but I did not reach a conclusion.

fredrik-johansson commented 2 weeks ago

I definitely consider _nmod_vec_dot_bound_limbs to be an implementation detail. Feel free to refactor the case selection in any way that makes sense.

Apart from nmod_mat_mul, some useful functions to benchmark are _nmod_poly_mul_classical, _nmod_poly_inv_series_basecase_preinv1 and _nmod_poly_exp_series.

Edit: also nmod_mat_solve_tril / nmod_mat_solve_triu.

fredrik-johansson commented 2 weeks ago

Question: for a small modulus (< 32 bits) and an unsigned long (possibly up to full 64 bits), is there something faster than NMOD_RED, e.g. using doubles, for modular reduction? I had a look at #1010 and some other places but I did not reach a conclusion.

I don't think so. IIRC, I tried at some point to do a 32-bit version of NMOD_RED but didn't notice any speedup. Maybe I missed some trick or that was just my machine though. I imagine that you can do several reductions in parallel a bit faster using avx2.

BTW, we definitely want to leverage the generics system to have dedicated 8-bit, 16-bit and 32-bit Z/nZ implementations (some code is already in place). Of course, that goal shouldn't obstruct optimizing the existing nmod format for small moduli.

Regarding doubles for modular reduction: I suspect with ulong input and output, the conversions are going to kill the advantage. What we probably want is rather a dedicated double based Z/nZ implementation using the same kind of machinery as in fft_small.

vneiger commented 2 weeks ago

Thanks for your input!

fredrik-johansson commented 1 week ago

Currently in gr/nmod.c there is some special code to select an algorithm quickly for short dot products. Maybe such a special case should be in _nmod_vec_dot_params.

vneiger commented 1 week ago

@fredrik-johansson thanks for your fix on profile/p-dot.c .

Currently in gr/nmod.c there is some special code to select an algorithm quickly for short dot products. Maybe such a special case should be in _nmod_vec_dot_params.

Ok, I will look into this. This is about ll864--898 of gr/nmod.c, I suppose. This was among my list of things to clarify before finishing this PR. For short dot products, and for code organization in general, it could be nicer to avoid computing len * (mod.n - 1)**2 each time we make a new dot product (_nmod_vec_dot doesn't, but __gr_nmod_vec_dot does, and also we do it implicitly if we call many times, on small input, functions that use dot products).

One approach: once the modulus is fixed, we could "precompute" ranges of dot product lengths for each number of limbs. For example, say up to length 30 then 2 limbs, and beyond that, 3 limbs; or up to length 115 then 1 limb, and beyond that, 2 limbs. If I'm not mistaken, for a given modulus we cannot have lengths (> 0) that are done with 1 limb and others with 3 limbs. So the precomputation could simply yield one length threshold, separating 1->2 limbs (if the modulus is <= 232) or 2->3 limbs (if the modulus is > 232, forbidding any length > 0 to be done with a single limb).

One would have to store this threshold somewhere. Could it be simply in _dot_params_t? Then an element of this type would be initialized from the modulus, and would contain this threshold. Then _nmod_vec_dot_params would be "simplified", with no computation of len * (mod.n - 1)**2 and only some if conditions with ulong comparisons. (One should still try to avoid repeated calls when possible...)

Is there any downside about adding a third field to this _dot_params struct?

vneiger commented 1 week ago

Some early, non-final comparison (new time / old time) via p-dot.c, on an IceLake server. The first line, 0, is for the modulus 2**63. Altogether, there is nice speed-up where expected, and sometimes some regression.

#0  --> vec dot           
#bits\len        1        5       10       25       50      100      250      500     1000     2500     5000    10000   100000  1000000
 0            1.50     1.89     1.86     2.31     2.59     3.39     3.24     3.58     3.38     3.61     3.70     3.71     1.76     1.18
12            1.05     0.97     0.95     1.07     0.80     1.35     1.29     1.62     1.69     1.63     1.14     1.13     0.89     0.90
28            1.03     0.98     0.96     1.07     0.86     1.33     1.29     1.60     1.68     4.38     3.39     3.40     1.38     1.00
30            1.03     0.97     0.94     1.07     0.80     2.63     2.86     3.91     4.17     4.33     3.39     3.40     1.49     1.00
31            1.02     0.97     0.96     1.39     1.46     2.59     2.87     3.90     4.16     4.37     3.39     3.40     1.38     1.00
32            1.04     0.98     1.09     0.97     0.90     0.95     0.95     0.97     0.98     0.97     0.99     0.99     0.96     0.91
40            1.04     0.98     1.03     0.97     0.94     0.99     0.90     0.88     0.87     0.86     0.95     0.95     0.97     0.94
50            1.03     0.98     1.02     0.97     0.94     0.99     0.89     0.89     0.87     0.85     0.96     0.95     0.97     0.94
60            1.02     0.98     1.04     0.97     0.93     0.99     0.90     0.89     0.87     1.29     1.34     1.35     1.28     1.13
61            1.02     0.99     1.02     0.97     0.94     0.99     0.90     1.28     1.29     1.29     1.34     1.35     1.29     1.13
62            1.02     0.98     1.02     0.97     0.94     1.17     1.22     1.28     1.29     1.29     1.34     1.35     1.30     1.15
63            1.02     0.98     1.03     1.02     1.02     1.17     1.22     1.28     1.29     1.28     1.35     1.35     1.28     1.13

#6  --> mat_mul           
#bits\len        1        5       10       25       50      100      250      500     1000     2500     5000    10000   100000  1000000
 0            1.01     0.98     0.99     0.98     0.98     0.98     0.95     0.87     0.85
12            1.01     1.00     0.99     0.97     0.98     0.99     1.01     0.98     0.91
28            1.01     1.00     0.99     0.98     0.97     0.99     1.01     0.93     0.92
30            1.02     1.00     0.99     0.98     0.97     2.35     2.87     1.91     1.89
31            1.02     1.00     0.99     1.31     1.54     2.35     2.86     1.94     1.90
32            1.02     1.01     0.97     1.01     0.96     0.96     0.96     0.96     0.95
40            1.05     1.01     0.99     0.99     1.02     1.04     0.97     0.96     0.95
50            1.05     1.01     0.98     0.99     1.02     1.04     0.97     0.96     0.95
60            1.04     1.01     0.98     0.99     1.02     1.04     0.97     0.96     0.95
61            1.05     1.01     0.99     0.98     1.03     1.04     0.97     1.28     1.29
62            1.05     0.99     0.99     0.99     1.02     1.19     1.26     1.30     1.29
63            1.03     1.01     0.99     1.02     1.10     1.19     1.26     1.28     1.29
albinahlback commented 1 week ago

What about splitting the dot-product into different functions? Regardless, it looks good!

I'm a little bit concerned why we get a slowdown around 1000 in length. Do you know why that is?

vneiger commented 1 week ago

What about splitting the dot-product into different functions? Regardless, it looks good!

Yes, we would still need the general function which calls the specific algorithms, but I agree it could be make the code clearer. And could be better performance-wise for some purposes: e.g. in _nmod_mat_addmul_transpose_op my early experiments showed it would be better to have the dot algorithm selection outside the two for loops, and inside these loops call directly the specific function among those that you are suggesting to create (maybe the branching of _nmod_vec_dot costs a bit in the current implementation, I have not checked).

I'm a little bit concerned why we get a slowdown around 1000 in length. Do you know why that is?

Unfortunately not. And it is not a general, easily reproducible fact. On my laptop (zen4) it looks like this:

#0  --> vec dot           
#bits\len        1        5       10       25       50      100      250      500     1000     2500     5000    10000   100000  1000000
 0            1.18     2.08     2.25     3.54     4.51     6.27     5.55     6.92     6.80     5.02     4.76     4.80     4.07     3.12
12            0.82     0.87     0.93     0.93     1.04     1.05     1.03     1.08     1.06     1.16     1.01     1.08     1.00     0.96
28            0.86     0.88     0.91     0.94     0.99     1.04     1.04     1.04     1.07     3.56     3.20     3.27     2.41     1.97
30            0.82     0.86     0.93     0.92     1.02     3.24     3.74     4.13     4.62     3.55     3.17     3.24     2.48     1.99
31            0.82     0.85     0.93     1.60     2.34     3.28     3.63     4.18     4.58     3.68     3.13     3.27     2.63     1.96
32            0.82     1.04     1.01     1.04     1.09     0.99     1.08     1.00     1.12     1.06     1.08     1.06     1.11     1.05
40            0.90     0.86     1.05     1.12     1.13     1.08     1.09     1.12     1.10     1.13     1.11     1.17     1.16     1.15
50            0.88     0.87     1.05     1.11     1.06     1.10     1.08     1.09     1.10     1.13     1.15     1.17     1.15     1.17
60            0.88     0.87     1.06     1.11     1.08     1.10     1.09     1.09     1.13     1.25     1.22     1.27     1.25     1.27
61            0.87     0.88     1.04     1.11     1.06     1.10     1.09     1.21     1.25     1.25     1.26     1.25     1.27     1.24
62            0.89     0.85     1.04     1.10     1.11     1.20     1.27     1.28     1.22     1.25     1.30     1.24     1.27     1.24
63            0.89     0.87     1.04     1.14     1.21     1.19     1.28     1.23     1.24     1.25     1.26     1.28     1.23     1.28

Both this one and the previous timings are on machines with avx512, which can play a role in particular for the 12-bit case where auto-vectorization works well. Here, on another machine without avx512, which does not have the "length 1000" issue either:


#0  --> vec dot           
#bits\len        1        5       10       25       50      100      250      500     1000     2500     5000    10000   100000  1000000
 0            1.17     1.96     1.70     2.06     2.21     2.45     2.48     2.51     2.65     2.50     2.51     2.44     1.77     1.59
12            0.82     0.96     1.01     1.17     1.38     1.99     1.99     2.23     2.27     1.82     1.78     1.66     1.17     1.34
28            0.80     0.94     1.02     1.18     1.39     1.98     1.95     2.16     2.25     2.78     2.81     2.77     1.58     1.65
30            0.79     0.95     1.02     1.20     1.39     2.38     2.66     3.35     3.73     2.81     2.82     2.58     1.68     1.58
31            0.79     0.94     1.03     1.25     1.57     2.33     2.67     3.40     3.72     2.82     2.84     2.68     1.68     1.65
32            0.80     1.01     1.01     0.89     0.93     0.95     0.94     0.93     1.02     0.98     1.01     1.01     0.92     0.98
40            0.95     1.01     1.04     1.08     1.10     1.08     1.07     1.07     1.07     1.12     1.06     1.07     1.13     1.15
50            0.95     1.02     1.03     1.09     1.10     1.08     1.07     1.06     1.07     1.11     1.09     1.11     1.14     1.13
60            0.95     1.02     1.02     1.08     1.09     1.08     1.08     1.07     1.06     1.39     1.39     1.38     1.32     1.32
61            0.94     1.04     1.03     1.07     1.10     1.10     1.08     1.33     1.33     1.41     1.38     1.35     1.39     1.29
62            0.95     1.03     1.03     1.09     1.10     1.24     1.32     1.35     1.36     1.39     1.39     1.39     1.33     1.31
63            0.94     1.04     1.03     1.07     1.20     1.25     1.31     1.31     1.34     1.39     1.38     1.39     1.34     1.32
albinahlback commented 1 week ago

Here, on another machine without avx512, which does not have the "length 1000" issue either

...

Looks like it still has the same problem, just that it is faster than the previous version and does not slow down as much. It is not important IMO, it could be a profiler thing. Just curious.

vneiger commented 6 days ago

Update on the state of this:

This feels close to complete for me, so if you have comments/suggestions, please let me know.

@fredrik-johansson

Currently in gr/nmod.c there is some special code to select an algorithm quickly for short dot products. Maybe such a special case should be in _nmod_vec_dot_params.

I have reworked the parameter selection a bit, so that it does less work than it used to for moduli of 32 bits or less. For the small length case, I have made several attempts, measuring what this could gain, and this did not help as much as I expected. For example I thought it would speed up doing many 2 x 2 matrix multiplications done successively, since this calls _nmod_vec_dot_params for each product... but it remains a negligible part of the computation already there. It does help a bit (20-30%) for computing a single dot_params + dot at a small length (say < 10), so I will still have a last attempt at this before finishing this PR.

@albinahlback

we get a slowdown around 1000 in length. Do you know why that is?

Maybe I misunderstood your point: I thought you mentioned length specifically 1000, and the slowdown we observed compared to the previous version (mat_mul has some < 1.00 ratios at length 1000 in the first timings above). If your remark was rather that for length around 1000 and more, then the ratio new version / old version becomes less interesting, yes this looks sensible, due to memory accesses. The dot product --- in particular in single limb cases, when the modulus is < 32 bits --- is quite memory intensive: for each product you load two coefficients, and you never re-use them (if doing avx2, for each 4 products via a single _m256i_mul_epu32 you have to load 2 * 256 bits from memory). So a (non-verified) explanation could be: from length around 1000-2000, we need to go beyond the L1 cache, and things become really dominated by the memory accesses. Or something of this kind :-) .

vneiger commented 6 days ago

Current comparisons between this branch and the main branch (ratio new / old), on 3 machines. Using gcc in all cases. The last two machines have avx512, the first one does not and it seems that gcc does not vectorize well the current dot product on that first machine (this is about the single limb case only: gcc does vectorize it reasonably well on the avx512 machines).

vneiger commented 4 days ago

Comparison after the last changes (ratios new times vs. times for main branch). The better handling of short dot products does help for polynomial routines. I also inserted comparisons of dot product times including the call to _nmod_vec_dot_params.

fredrik-johansson commented 4 days ago

Perhaps we should keep _nmod_vec_dot_bound_limbs for uses like lu_classical_delayed where we want to bound the size of a dot product but where we don't actually call _nmod_vec_dot (or one of its variants)? Though maybe we want to add more versions of lu_classical_delayed as well...

vneiger commented 4 days ago

For the documentation, does it sound reasonable to not go into the details of the implementation (specific algorithms, choices, ...), and only mention the main interfaces such as _nmod_vec_dot and NMOD_VEC_DOT?

In gr/nmod.c, the short dot algorithm selection should not be needed anymore, since it is incorporated in the general parameters selection. So I removed it.

Currently this gr/nmod.c file uses the macro NMOD_VEC_DOT, could we instead use the function _nmod_vec_dot? (I'm not very familiar with the whole gr framework, so I want to avoid breaking anything.) The reason is that the avx-vectorization is not done in the macro, as it does not play nicely with it. The macro accepts a general expression as a function of the vector index i, but for loading data into avx registers we need to know a bit more about this function of i. Cf the differences of implementation of _nmod_vec_dot2_split, _nmod_vec_dot2_rev_split, _nmod_vec_dot2_ptr_split in nmod_vec/dot.c: the latter is less efficient due to the non-contiguous load and this is what would happen to the general macro if we tried to put a general avx version therein.

vneiger commented 4 days ago

(I suppose the documentation should still contain a word about the types dot_method_t and dot_params_t.)

vneiger commented 3 days ago

Updated exp_series comparisons: (for that one, and for small len, some timings are quite off in one way or the other, not consistently... but apart from that, things look similar to before or faster)

Ryzen 7 7840U

#8  --> poly_exp_series        
#bits\len        1        2        3        4        5        7       10       15       25       35       50      100      250      500     1000     2500     5000    10000   100000  1000000
 12           1.10     0.55     1.27     1.19     1.07     1.11     1.06     1.09     1.14     1.13     1.08     1.11     1.07     0.98     0.94     0.92     0.96     0.98
 28           1.04     1.03     7.11     0.60     1.38     1.20     1.22     1.09     1.02     1.10     1.09     1.10     1.06     1.00     0.94     3.65     3.78     3.84
 30           1.02     0.99     0.98     1.11     0.16     0.39     2.51     1.13     1.10     1.10     1.05     1.54     2.54     3.15     3.60     4.54     3.94     3.69
 31           1.08     0.38     1.02     1.80     1.11     1.12     1.05     1.10     1.10     1.18     1.28     1.58     2.78     3.03     3.90     3.99     3.91     3.88
 32           1.02     2.30     1.43     0.57     0.62     0.89     1.03     0.99     1.03     1.08     1.12     1.24     1.52     1.31     1.47     1.40     1.45     1.38
 40           1.03     0.44     0.67     1.62     0.81     0.90     1.12     0.91     0.97     0.93     1.07     0.94     1.02     1.02     1.03     1.04     1.04     1.07
 50           1.06     1.00     7.18     0.99     1.02     1.03     1.01     0.97     0.94     0.97     0.98     0.99     1.12     1.03     1.04     1.10     1.05     1.10
 60           1.01     1.07     0.96     0.55     14.35     1.01     1.14     0.90     0.95     0.93     1.04     0.94     1.03     1.02     1.02     1.27     1.30     1.35
 61           1.06     0.37     0.14     1.81     0.07     0.99     1.13     0.89     0.97     0.78     1.08     0.99     1.02     1.22     1.30     1.40     1.30     1.40
 62           0.98     1.01     7.62     0.98     1.14     2.46     0.98     0.97     0.96     0.83     0.96     1.12     1.26     1.26     1.30     1.43     1.31     1.41
 63           0.99     1.03     1.03     0.92     0.45     1.02     1.11     1.13     1.06     1.06     1.11     1.18     1.23     1.32     1.23     1.34     1.35     1.31
 64           1.03     1.01     0.18     2.01     2.29     1.06     1.07     1.00     1.00     1.03     1.11     1.11     1.18     1.16     1.30     1.27     1.28     1.30

Xeon E7-4820

#8  --> poly_exp_series   
#bits\len        1        2        3        4        5        7       10       15       25       35       50      100      250      500     1000     2500     5000    10000   100000  1000000
 12           1.00     2.40     1.01     0.95     1.04     0.98     1.00     0.98     1.02     1.05     1.04     1.09     1.02     1.02     0.99     0.98     0.96     0.95
 28           1.01     2.40     0.84     0.77     1.03     0.03     0.98     1.04     1.01     1.03     1.05     1.07     1.03     1.02     0.99     1.45     1.31     1.27
 30           1.00     1.00     1.62     0.99     0.30     0.97     1.04     1.04     1.04     1.06     1.03     1.19     1.38     1.55     1.66     1.60     1.33     1.28
 31           1.00     0.42     10.81     1.01     0.76     0.22     1.04     1.03     1.03     1.09     1.12     1.27     1.43     1.55     1.66     1.60     1.34     1.28
 32           0.99     2.42     0.11     2.62     1.01     1.02     0.99     0.99     0.99     0.97     0.93     0.89     0.83     0.73     0.70     0.68     0.67     0.66
 40           1.00     2.41     0.99     2.05     0.95     0.88     0.98     1.00     1.03     1.05     1.06     1.11     1.14     1.07     1.07     1.05     1.00     0.98
 50           1.00     0.42     0.51     6.01     0.97     1.01     0.91     1.00     1.04     1.06     1.07     1.11     1.14     1.07     1.07     1.04     0.99     0.98
 60           0.99     1.02     1.00     0.98     1.09     1.01     1.02     1.00     1.02     1.04     1.06     1.10     1.13     1.07     1.07     1.72     1.76     1.78
 61           0.99     0.42     10.89     0.98     0.91     1.00     1.03     1.01     1.01     1.04     1.05     1.11     1.13     1.55     1.77     1.84     1.80     1.78
 62           1.00     1.01     1.64     2.72     1.30     0.99     1.04     1.00     1.02     1.05     1.07     1.28     1.57     1.71     1.79     1.85     1.79     1.78
 63           1.00     2.41     1.62     1.04     0.26     1.04     1.00     1.01     1.08     1.14     1.21     1.39     1.61     1.72     1.81     1.85     1.78     1.78
 64           1.00     0.99     1.01     12.94     3.55     5.19     1.03     1.14     1.02     1.05     1.03     1.04     1.02     0.98     0.98     0.98     0.98     0.98
vneiger commented 3 days ago

This now seems ready for review.

fredrik-johansson commented 3 days ago

It looks very good! Would be nice to get more consistent timings for the small cases though.

I will give the code a careful review soon.

vneiger commented 3 days ago

I will give the code a careful review soon.

Thanks!

It looks very good! Would be nice to get more consistent timings for the small cases though.

Yes I agree... fortunately this is only for the exp_series bench (all other functions are pretty consistent including for the smallest lengths). This has been consistently inconsistent since the beginning of my work on this PR. You can see below the timings for the current main branch, this already is inconsistent:

#8  --> poly_exp_series        
#bits\len        1        2        3        4        5
12        6.94e-09 6.93e-09 6.82e-09 3.61e-08 8.05e-08
28        5.17e-09 5.22e-09 3.81e-08 3.13e-08 6.01e-08
30        5.17e-09 5.31e-09 2.75e-08 5.67e-08 5.18e-09
31        5.44e-09 5.39e-09 5.18e-09 5.71e-08 7.56e-08
32        5.18e-09 1.22e-08 3.99e-08 3.06e-08 4.12e-08
40        5.17e-09 5.31e-09 2.75e-08 5.01e-08 6.11e-08
50        5.33e-09 1.22e-08 3.67e-08 5.68e-08 7.43e-08
60        5.32e-09 5.33e-09 3.83e-08 3.05e-08 7.55e-08
61        5.33e-09 5.23e-09 5.20e-09 5.65e-08 5.18e-09
62        5.19e-09 1.22e-08 3.94e-08 4.83e-08 7.23e-08
63        5.19e-09 1.22e-08 2.89e-08 3.06e-08 3.34e-08
64        5.18e-09 5.31e-09 5.18e-09 6.17e-08 8.15e-08
vneiger commented 3 days ago

By the way, here are the timings (before taking ratios), in case that can help. And the script I used to compute the ratios. dot_bench.zip