flintlib / flint

FLINT (Fast Library for Number Theory)
http://www.flintlib.org
GNU Lesser General Public License v3.0
401 stars 235 forks source link

Start on tuning suite and start on draft for `n_mod` #1991

Closed albinahlback closed 1 month ago

albinahlback commented 1 month ago

Very preliminary, but more to generate a discussion. I'd be open to changing the name.

I would like to avoid modulo 1 and moduli having the highest bit set in this module. Thus, one can assume that $1 \neq 0$ and that things needs to be normalized when doing arithmetic.

The context structures contains modulus, normalized modulus, inverse of modulus and normalization count.

fredrik-johansson commented 1 month ago

I would like to avoid modulo 1 and moduli having the highest bit set in this module. Thus, one can assume that and that things needs to be normalized when doing arithmetic.

IMHO these restrictions are not useful. It is not much trouble to support mod 1 (it requires special code in very few places), and it is easy to dispatch to optimized functions for different cases (as in _nmod_vec). There will be such branches anyway, e.g. for sums and dot products, where number of terms that can be added before overflowing two limbs varies with the norm.

Personally I think I would try to do this module using inlined pass-by-value underscore methods which can come in versions for restricted moduli, and with non-inlined pass-by-pointer gr methods wrapping those.

albinahlback commented 1 month ago

Hmm, I think I agree.

What about this:

albinahlback commented 1 month ago

At least at first, a base ring interface doesn't give us anything (apart from a coherent interface) since it is not what is usually performance-critical.

albinahlback commented 1 month ago

@fredrik-johansson this is a pretty good draft for tuning, I believe. I think I will remove the n_mod stuff from Makefile.in, but keep them in this commit. Perhaps I could start to implement more multiplication routines and try to tune them.

I still have to find out how to tune better, but one can run it now for _n_mod_vec_(add|sub), which confirms on my computer that the original method actually is slightly faster. Perhaps I need to up the number of runs or something.

albinahlback commented 1 month ago

You can now run make tune and then ./build/tuneup. This will print what should be pushed into flint-mparam.h.

I don't know if the current timer is the best for Arm or not. Preferably, one would use the clock counter, but that requires privilege. Currently, I just default to POSIX clock_gettime.

thofma commented 1 month ago

Thanks for allowing modulus 1!

fredrik-johansson commented 1 month ago

Why the new src/tune instead of following the existing convention of using src/MODULE/tune? The command make tune already exists, and it is annoying to have directories in src that are not modules.

(Though one might ask why we don't use the paths test/MODULE, profile/MODULE, tune/MODULE instead of src/MODULE/test, src/MODULE/profile, src/MODULE/tune...)

fredrik-johansson commented 1 month ago

For n_mod, there needs to be a plan to migrate existing nmod functions and types instead of having duplicate implementations.

I'd consider doing 8-bit, 16-bit and 32-bit versions simultaneously since this is something we want anyway.

albinahlback commented 1 month ago

Why the new src/tune instead of following the existing convention of using src/MODULE/tune? The command make tune already exists, and it is annoying to have directories in src that are not modules.

(Though one might ask why we don't use the paths test/MODULE, profile/MODULE, tune/MODULE instead of src/MODULE/test, src/MODULE/profile, src/MODULE/tune...)

Because the code is very local. What is in src/tune/tune.c and src/tune/ulong_extras/tune_xgcd.c are very much dependent on each other. Would therefore prefer to keep under the same tree.

albinahlback commented 1 month ago

For n_mod, there needs to be a plan to migrate existing nmod functions and types instead of having duplicate implementations.

I'd consider doing 8-bit, 16-bit and 32-bit versions simultaneously since this is something we want anyway.

Agreed. I believe we have to fully look into how code is translated into machine code for this module. I believe the function where it matters the most is the dot-product.

The main goal of this PR shifted a bit, and I'm very fine with removing the current status of the n_mod modules.

math-gout commented 1 month ago

I don't know if the current timer is the best for Arm or not. Preferably, one would use the clock counter, but that requires privilege. Currently, I just default to POSIX clock_gettime.

Looking online I found this : https://github.com/google/benchmark/blob/v1.1.0/src/cycleclock.h

On my M2 Mac with the current implementation I get :

#define N_GCDEXT_METHOD                         1 /* 0.14% faster than 0 */
#define N_MOD_VEC_ADD_METHOD                    0 /* nan% faster than 1 */
#define N_MOD_VEC_SUB_METHOD                    0 /* nan% faster than 1 */

And with mach_absolute_time() I get :

#define N_GCDEXT_METHOD                         1 /* 0.29% faster than 0 */
#define N_MOD_VEC_ADD_METHOD                    0 /* 0.20% faster than 1 */
#define N_MOD_VEC_SUB_METHOD                    1 /* 0.25% faster than 0 */
albinahlback commented 1 month ago

But why does it yield NaN though? Perhaps one should stick with the unsecure option for Arm64, and instruct the user on how to minimize the risk while doing so.

math-gout commented 1 month ago

I dit a printf on tv.tv_nsec and it ends in 3 zeroes, so precisions must be up to microseconds and maybe the tests take less then 1us and it results in a division by zero when calculating delta time in percentage ?

albinahlback commented 1 month ago

I see. I intend to support running tests longer in order to improve precision, but of those architectures where we really intend to optimize (Aarch64 and x86), it is nice to get exact measurements to see where we are at.

albinahlback commented 1 month ago

On my M2 Mac with the current implementation I get :

#define N_GCDEXT_METHOD                         1 /* 0.14% faster than 0 */
#define N_MOD_VEC_ADD_METHOD                    0 /* nan% faster than 1 */
#define N_MOD_VEC_SUB_METHOD                    0 /* nan% faster than 1 */

And with mach_absolute_time() I get :

#define N_GCDEXT_METHOD                         1 /* 0.29% faster than 0 */
#define N_MOD_VEC_ADD_METHOD                    0 /* 0.20% faster than 1 */
#define N_MOD_VEC_SUB_METHOD                    1 /* 0.25% faster than 0 */

Okay, seems like this only affects macOS, and not Linux. Tried cfarm103 vs cfarm104 (both M1 running Debian vs macOS), and this problem only arise on macOS.

albinahlback commented 1 month ago

Perhaps it is better to try to utilize

__asm__ volatile ("mrs\t%0, CNTVCT_EL0" : "=r" (cc));

as that is what FFTW does.

The granularity is pretty rough though (I believe the frequency for most of Aarch64 pieces are around 10-50 MHz, as to compared to the clock cycle at around 3 GHz), but it should be fine for measuring functions that takes a longer time. Not sure if macOS and Linux are using this one.