Closed albinahlback closed 1 month ago
I would like to avoid modulo 1 and moduli having the highest bit set in this module. Thus, one can assume that and that things needs to be normalized when doing arithmetic.
IMHO these restrictions are not useful. It is not much trouble to support mod 1 (it requires special code in very few places), and it is easy to dispatch to optimized functions for different cases (as in _nmod_vec
). There will be such branches anyway, e.g. for sums and dot products, where number of terms that can be added before overflowing two limbs varies with the norm.
Personally I think I would try to do this module using inlined pass-by-value underscore methods which can come in versions for restricted moduli, and with non-inlined pass-by-pointer gr
methods wrapping those.
Hmm, I think I agree.
What about this:
fmpz_mod
).At least at first, a base ring interface doesn't give us anything (apart from a coherent interface) since it is not what is usually performance-critical.
@fredrik-johansson this is a pretty good draft for tuning, I believe. I think I will remove the n_mod
stuff from Makefile.in
, but keep them in this commit. Perhaps I could start to implement more multiplication routines and try to tune them.
I still have to find out how to tune better, but one can run it now for _n_mod_vec_(add|sub)
, which confirms on my computer that the original method actually is slightly faster. Perhaps I need to up the number of runs or something.
You can now run make tune
and then ./build/tuneup
. This will print what should be pushed into flint-mparam.h
.
I don't know if the current timer is the best for Arm or not. Preferably, one would use the clock counter, but that requires privilege. Currently, I just default to POSIX clock_gettime
.
Thanks for allowing modulus 1!
Why the new src/tune
instead of following the existing convention of using src/MODULE/tune
? The command make tune
already exists, and it is annoying to have directories in src
that are not modules.
(Though one might ask why we don't use the paths test/MODULE
, profile/MODULE
, tune/MODULE
instead of src/MODULE/test
, src/MODULE/profile
, src/MODULE/tune
...)
For n_mod
, there needs to be a plan to migrate existing nmod
functions and types instead of having duplicate implementations.
I'd consider doing 8-bit, 16-bit and 32-bit versions simultaneously since this is something we want anyway.
Why the new
src/tune
instead of following the existing convention of usingsrc/MODULE/tune
? The commandmake tune
already exists, and it is annoying to have directories insrc
that are not modules.(Though one might ask why we don't use the paths
test/MODULE
,profile/MODULE
,tune/MODULE
instead ofsrc/MODULE/test
,src/MODULE/profile
,src/MODULE/tune
...)
Because the code is very local. What is in src/tune/tune.c
and src/tune/ulong_extras/tune_xgcd.c
are very much dependent on each other. Would therefore prefer to keep under the same tree.
For
n_mod
, there needs to be a plan to migrate existingnmod
functions and types instead of having duplicate implementations.I'd consider doing 8-bit, 16-bit and 32-bit versions simultaneously since this is something we want anyway.
Agreed. I believe we have to fully look into how code is translated into machine code for this module. I believe the function where it matters the most is the dot-product.
The main goal of this PR shifted a bit, and I'm very fine with removing the current status of the n_mod
modules.
I don't know if the current timer is the best for Arm or not. Preferably, one would use the clock counter, but that requires privilege. Currently, I just default to POSIX
clock_gettime
.
Looking online I found this : https://github.com/google/benchmark/blob/v1.1.0/src/cycleclock.h
On my M2 Mac with the current implementation I get :
#define N_GCDEXT_METHOD 1 /* 0.14% faster than 0 */
#define N_MOD_VEC_ADD_METHOD 0 /* nan% faster than 1 */
#define N_MOD_VEC_SUB_METHOD 0 /* nan% faster than 1 */
And with mach_absolute_time()
I get :
#define N_GCDEXT_METHOD 1 /* 0.29% faster than 0 */
#define N_MOD_VEC_ADD_METHOD 0 /* 0.20% faster than 1 */
#define N_MOD_VEC_SUB_METHOD 1 /* 0.25% faster than 0 */
But why does it yield NaN though? Perhaps one should stick with the unsecure option for Arm64, and instruct the user on how to minimize the risk while doing so.
I dit a printf on tv.tv_nsec
and it ends in 3 zeroes, so precisions must be up to microseconds and maybe the tests take less then 1us
and it results in a division by zero when calculating delta time in percentage ?
I see. I intend to support running tests longer in order to improve precision, but of those architectures where we really intend to optimize (Aarch64 and x86), it is nice to get exact measurements to see where we are at.
On my M2 Mac with the current implementation I get :
#define N_GCDEXT_METHOD 1 /* 0.14% faster than 0 */ #define N_MOD_VEC_ADD_METHOD 0 /* nan% faster than 1 */ #define N_MOD_VEC_SUB_METHOD 0 /* nan% faster than 1 */
And with
mach_absolute_time()
I get :#define N_GCDEXT_METHOD 1 /* 0.29% faster than 0 */ #define N_MOD_VEC_ADD_METHOD 0 /* 0.20% faster than 1 */ #define N_MOD_VEC_SUB_METHOD 1 /* 0.25% faster than 0 */
Okay, seems like this only affects macOS, and not Linux. Tried cfarm103 vs cfarm104 (both M1 running Debian vs macOS), and this problem only arise on macOS.
Perhaps it is better to try to utilize
__asm__ volatile ("mrs\t%0, CNTVCT_EL0" : "=r" (cc));
as that is what FFTW does.
The granularity is pretty rough though (I believe the frequency for most of Aarch64 pieces are around 10-50 MHz, as to compared to the clock cycle at around 3 GHz), but it should be fine for measuring functions that takes a longer time. Not sure if macOS and Linux are using this one.
Very preliminary, but more to generate a discussion. I'd be open to changing the name.
I would like to avoid modulo 1 and moduli having the highest bit set in this module. Thus, one can assume that $1 \neq 0$ and that things needs to be normalized when doing arithmetic.
The context structures contains modulus, normalized modulus, inverse of modulus and normalization count.