Closed burlen closed 1 year ago
compiled with gcc -O3 -march=native -mtune=native
on a Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
. this is an older CPU.
The AESNI and ARS runs used the 4x32
types not the 1xm128i
type because the latter does not work with r123::boxmuller
nor r123::uneg11
. It is possible that ARS with the 1xm128i
type may be the fastest.
OpenMP parallel implementation has perfect strong scaling. Tested on a system with 10 physical cores.
32 bit types are faster than 64 bit types. this makes sense because of vectorization. 2x are the same as 4x types possibly because of inlining and loop unrolling optimizations result in exactly the same code for either.
was merged in #28
some experiments w/ random123. These have been included into #28. Note: in #28 I squashed the commits marked squash here.