Open fredrik-johansson opened 2 years ago
Very surprising. I can't imagine for the life of me what we could have overlooked there. So much care went into making that efficient. Is NTL using doubles?
We could do the following, where they make sense (conversion costs/representation need to be considered):
If n is odd and small enough and we use a balanced representation -n/2...n/2, this does a correctly reduced multiplication:
double dmod_mul(double a, double b, double n, double ninv)
{
double magic = 6755399441055744.0;
double r = a * b;
return r - ((r * ninv + magic) - magic) * n;
}
This should be good for doing lots of multiplications in parallel with SIMD. How quickly can we add in this representation?
Another idea: in multimodular algorithms, we generally use primes of the form 2^n + c
or 2^n - c
where c
is small. For multi-word moduli, this can certainly be exploited, but what about nmods?
I have at least part of the answer: n_mulmod_preinv requires the inputs to be reduced and does something fast, n_mulmod2_preinv does not require the inputs to be reduced and does something slow.
Our nmod_mul is stupidly doing the same thing as n_mulmod2_preinv. We should basically just change it to do an n_mulmod_preinv instead; this is 2x faster on my machine.
We can also replace many other uses of n_mulmod2_preinv with n_mulmod_preinv throughout Flint.
Some shifts can be avoided when the modulus has exactly FLINT_BITS bits; maybe this is worth optimizing for in various places.
Ditto for nmod_addmul / NMOD_ADDMUL.
If n is odd and small enough and we use a balanced representation -n/2...n/2, this does a correctly reduced multiplication
how small is small enough? This is assuming no fmadd/fmsub? What about with fmadd/fmsub?
If n is odd and small enough and we use a balanced representation -n/2...n/2, this does a correctly reduced multiplication
how small is small enough? This is assuming no fmadd/fmsub? What about with fmadd/fmsub?
Up to sqrt(2^53) I guess, but I did not check or prove this. You might want to design an entirely different algorithm around fma.
FWIW, flint does better if one uses nmod_mul
and n_mulmod_shoup
.
After that, the remaining difference seems to be due to n_divrem2_preinv
being much slower than MulDivRem
.
Several functions for modular arithmetic like n_mulmod2_preinv
, n_mod2_preinv
, n_ll_mod_preinv
, n_lll_mod_preinv
don't take a norm as input and therefore need an flint_clz
operation which is redundant in situations where we already have an nmod_t
containing this data. There are also probably many places (though not all) where these operations should actually be inlined. Replacing them with nmod_mul
, NMOD2_RED2
etc. would be an improvement.
We should think about ways to redesign these interfaces so that they are more obvious.
It would be nice to separate those that need normalization and those who does not. Not sure how to do that user friendly though.
another point of comparison should be https://math.mit.edu/~drew/ffpoly.html I think @andrewvsutherland (the author) did a comparison at some point...
Here is a simple function (bernsum_powg) taken from David Harvey's bernmm library (using NTL), and two FLINT versions using the n_precomp and n_preinvert interfaces.
Timings on my machine:
So our n_precomp arithmetic is 1.6x slower than NTL, and our n_preinvert arithmetic is 3x slower. I even cheated here -- NTL has a muldivrem function which we don't, so I put in a plain multiplication which of course will overflow if p is large.