Closed mattip closed 3 years ago
@tdimitri why did you close this? Did you figure out what the problem is?
I meant to merge and pull... so my apologies about that. I put your change in. Perhaps the code should always init() at load time. Then perhaps add a new switch pn.enable() / pn.disable() which would just disable everything or enable everything since I think that is most likely how others will use it (once "import pnumpy" is done, it will be enabled by default).
As per why the times you posted are always the same... Jack was mentioning something about Linux running differently but I have not investigated myself.
Checking in a faster np.isinf routine for float32. This was 50% of the problem... but there are still other issues with the benchmark. After the checkin, the np.isinf() routine for float32 should be faster than numpy's (at least on my intel chip it is).
So all I will look at is float32/float64 times. When I do I see this...
1 float32 0 125±0.06μs 125±0.1μs
1 float32 2 128±0.1μs 634±0.3μs
1 float32 4 128±0.1μs 634±0.5μs
Which looks wrong -- when there are 0 extra threads, it looks atop is not enabled (because back then it should have produced a slower time and the time are the same for 0 threads).
Next question, does the benchmark code initialize the array with np.ones or np.zeros?? if so that is an oversight since we dont have ones/zeros hooked yet and thus the main thread will own the cache of the entire array. A possible way to fix this is to add +=0 or something like that which should call our add routine and fix the cache lines. Also make sure the SAME output buffer is used (as opposed to creating a new output buffer every time). What I usually do is NOT time the first run... the first untimed dry run will clean up code cache and data cache lines. Then after the first dry run, start timing.
The benchmark questions are for @jack-pappas
@mattip The benchmarks are covering the same set of functions as in the numpy benchmarks; we're not yet overriding/threading all of the ufuncs and I'm not aware of an easy way to tell whether pnumpy is overriding a given (ufunc, dtype) pair -- so the benchmarks just perform a simple parameter sweep over all ufuncs, and you'll see that reflected in the benchmark numbers (when you don't see a performance improvement in the "threaded" version of some function). At the moment, we're not even attempting to accelerate calculations for float16
or any of the complex dtypes; they're only included in the benchmarks for comparisons against regular numpy.
The ufuncs for which you should expect at least some threading speedup for some (but likely not all) dtypes in the current version of pnumpy:
abs
, absolute
, add
, arccos
, arccosh
, arcsin
, arcsinh
, arctan
, arctanh
, bitwise_and
, bitwise_not
, bitwise_or
, bitwise_xor
, cbrt
, ceil
, cos
, cosh
, divide
, equal
, exp
, exp2
, expm1
, fmax
, fmin
, greater
, greater_equal
, invert
, isfinite
, isinf
, isnan
, left_shift
, less
, less_equal
, log
, log10
, log1p
, log2
, logical_not
, maximum
, minimum
, multiply
, negative
, not_equal
, right_shift
, rint
, sign
, signbit
(I'm not 100% sure about this one), sin
, sinh
, sqrt
, subtract
, tan
, tanh
, true_divide
, trunc
These functions aren't accelerated yet, for any dtypes, but could be in the future:
arctan2
, conj
, conjugate
, copysign
, deg2rad
, degrees
, divmod
, fabs
, float_power
, floor_divide
, fmod
, frexp
, gcd
, heaviside
, hypot
, isnat
, lcm
, ldexp
, logaddexp
, logaddexp2
, logical_and
, logical_or
, logical_xor
, matmul
, mod
, modf
, nextafter
, positive
, power
, rad2deg
, radians
, reciprocal
, remainder
, spacing
, square
See below for the full results (for ufuncs) from my full run a few weeks ago (against 05a12715):
@mattip It looks like asv will use (by default) the forkserver
"spawner" when running benchmarks on Linux and the spawn
"spawner" when running on Windows or Mac:
Can you try running the same benchmark you did above, but adding --launch-method spawn
to your asv invocation? That should force asv to run the benchmarks in the same way on Linux as it does on Windows/Mac, and if you see good results that way, we'll know it's something specific to asv and/or something we're doing in pnumpy that's not playing nicely with forking (maybe we need to use pthread_atfork
or the Python equivalent os.register_at_fork()
to re-initialize pnumpy/atop in the child process).
It seems --launch-method spawn does the trick. I will submit a PR to move the rank to be the last parameter which moves the dtype and nthreads to the first two dimensions of the output.
$ asv run --bench isinf --launch-method spawn
· Creating environments
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For pnumpy commit aad20e60 <main>:
[ 0.00%] ·· Benchmarking virtualenv-py3.8-numpy
[ 50.00%] ··· Running (bench_ufunc.UFunc_isinf.time_ufunc_types--).
[100.00%] ··· bench_ufunc.UFunc_isinf.time_ufunc_types
ok
[100.00%] ··· ============= ========== ============= ============= ============= =============
-- atop / rank
------------------------ -------------------------------------------------------
input_dtype nthreads False / 1 False / 2 True / 1 True / 2
============= ========== ============= ============= ============= =============
int16 0 13.5±0.03μs 13.6±0.1μs 13.5±0.05μs 13.5±0.03μs
int16 2 34.2±0.3μs 32.9±1μs 26.0±6μs 26.7±8μs
int16 4 18.0±2μs 18.0±1μs 20.2±1μs 17.2±2μs
float16 0 2.11±0.01ms 2.09±0ms 2.09±0ms 2.09±0ms
float16 2 2.09±0ms 2.09±0ms 2.10±0.01ms 2.09±0ms
float16 4 2.09±0ms 2.09±0ms 2.09±0ms 2.09±0ms
int32 0 13.6±0.02μs 13.6±0.03μs 13.5±0.02μs 13.6±0.03μs
int32 2 34.2±2μs 36.4±0.6μs 35.0±2μs 37.0±0.5μs
int32 4 18.1±0.1μs 17.6±0.6μs 21.3±1μs 18.2±0.5μs
float32 0 126±0.5μs 125±0.08μs 88.0±0.1μs 88.3±0.5μs
float32 2 77.0±0.6μs 76.9±1μs 58.9±1μs 58.3±1μs
float32 4 43.1±0.4μs 43.4±2μs 33.1±2μs 34.6±3μs
int64 0 13.6±0.04μs 13.6±0.06μs 13.7±0.1μs 13.6±0.03μs
int64 2 33.4±5μs 25.8±6μs 34.4±2μs 32.5±0.6μs
int64 4 16.7±2μs 18.8±0.6μs 18.2±0.4μs 20.0±2μs
float64 0 255±2μs 255±0.8μs 180±2μs 178±1μs
float64 2 141±0.5μs 142±0.7μs 110±4μs 114±1μs
float64 4 76.9±0.3μs 75.9±1μs 55.3±0.4μs 57.5±4μs
complex64 0 805±0.3μs 806±0.9μs 806±0.6μs 806±0.8μs
complex64 2 806±2μs 806±8μs 805±0.7μs 806±0.9μs
complex64 4 812±7μs 807±3μs 810±5μs 805±0.2μs
longfloat 0 6.02±0ms 6.03±0ms 6.03±0.02ms 6.03±0ms
longfloat 2 6.04±0.01ms 6.04±0.04ms 6.03±0ms 6.06±0.03ms
longfloat 4 6.03±0.01ms 6.02±0.01ms 6.04±0.01ms 6.03±0.01ms
complex128 0 1.02±0.02ms 1.09±0.04ms 1.07±0.03ms 1.04±0.01ms
complex128 2 1.03±0.01ms 1.04±0.01ms 1.04±0.01ms 1.05±0.01ms
complex128 4 1.05±0.01ms 1.03±0.01ms 1.00±0ms 1.12±0.1ms
complex256 0 11.6±0.01ms 11.6±0.02ms 11.6±0.02ms 11.6±0.01ms
complex256 2 11.6±0.01ms 11.6±0.01ms 11.6±0.01ms 11.6±0.01ms
complex256 4 11.6±0.03ms 11.6±0.01ms 11.6±0.04ms 11.7±0.1ms
============= ========== ============= ============= ============= =============
Related to a comment about initializing pnumpy. Now it is initialized just before the call to
thread_setworkers(num_threads)
Please try running
from the top directory with this changeset. Note how the benchmarks do not change with the number of threads