Open jack-pappas opened 3 years ago
I ran the benchmarks on a intel machine after running sudo pyperf system tune
, but did not see any improvement when activating multiple threads. Here is the machine.json and the compressed .asv/results
directory.
{
"arch": "x86_64",
"cpu": "Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz",
"machine": "benchmarker",
"num_cpu": "8",
"os": "Linux 4.15.0-74-generic",
"ram": "65748452",
"version": 1
}
The benchmarks ran for 2 hours on this machine
@jack-pappas @tdimitri: any thoughts why I do not see an improvement?
Matti, did you do...
pn.init()
pn.benchmark()
What are the numbers returned? Then now there is a parallel lexsort and a parallel sort.
No, I followed the instructions on the benchmarks README
asv run
Here is my result for pn.benchmark()
:
>>> pn.benchmark()
1000000 rows,bool,int8,int16,int32,int64,float32,float64,
a==b , 0.99, 1.00, 1.00, 1.15, 1.01, 1.15, 1.02,
a==5 , 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.02,
a+b, 1.01, 1.00, 1.00, 1.06, 1.01, 0.97, 1.00,
a+5, 1.13, 1.00, 1.01, 1.00, 1.07, 1.02, 1.05,
a/5, 1.00, 1.00, 1.00, 0.99, 1.00, 1.00, 1.00,
abs, 1.00, 1.00, 1.00, 0.93, 0.98, 1.00, 1.08,
isnan, 1.00, 1.01, 1.01, 1.00, 1.01, 1.02, 0.99,
sin, 1.00, 0.99, 1.00, 1.00, 1.00, 0.98, 1.00,
log, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00,
sum, 1.00, 1.00, 1.00, 1.00, 1.02, 1.00, 1.02,
min, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00,
Ahh, hangon, after pn.init()
it gets better:
>>> pn.init()
>>> pn.benchmark()
1000000 rows,bool,int8,int16,int32,int64,float32,float64,
a==b , 6.79, 2.58, 2.59, 3.29, 6.67, 2.45, 6.14,
a==5 , 4.71, 1.81, 1.87, 3.00, 4.69, 1.97, 2.64,
a+b, 9.37, 2.31, 2.46, 3.14, 9.44, 2.89, 9.20,
a+5, 4.12, 2.33, 2.16, 2.75, 4.23, 1.85, 4.78,
a/5, 0.72, 0.86, 0.87, 0.91, 0.70, 4.08, 6.99,
abs, 4.02, 5.83, 6.53, 3.16, 4.00, 9.85,11.18,
isnan, 0.79, 0.70, 0.80, 0.74, 0.80, 1.96, 2.73,
sin, 4.30, 3.88, 3.95, 8.81, 5.32,21.15,60.16,
log, 1.25, 2.13, 2.17, 1.30, 1.58, 6.39, 3.05,
sum, 8.28, 1.01, 1.04, 1.00, 9.61, 6.45, 5.44,
min, 3.65,41.85,41.73,31.00, 3.66, 1.93, 2.64,
Why isn't that reflected in the ASV results?
I will check with Jack and review his benchmark, I did not work with him on his benchmark and I apologize for any confusion. The benchmarks are hard because we have not hooked the "initialization" functions yet (like ones, zeros, arange, etc). We also have not hooked the copy functions, copy with mask, etc. We also have not hooked the conversion functions. I spent the last 10 hours trying to figure out how to hook the conversion functions, calling PyArray_RegisterCastFunc.. but does not seem to work yet.
Your numbers above look good and expected. One dip is in division of integers because it converts from int to float64 and does so in the main thread, thus invalidating the other cores... which is why I am trying to hook more functions.
Ideally divide would "convert and divide" on the fly... but we also cannot hook that right now.
On a good note... there is pn.getitem() which acts like a[b] when a is an array, and b is a boolean or fancy index array. It runs in parallel. On another good note... I have reviewed so much numpy internal low level code, I understand it better and can at least suggest hooks.
We're calling pn.initialize()
within the ASV benchmarks: https://github.com/Quansight/numpy-threading-extensions/blob/97c60ed86fa105e18e1b5d2373576694863787be/benchmarks/bench_ufunc.py#L19
The current version of pn.initialize()
just calls pn.init()
: https://github.com/Quansight/numpy-threading-extensions/blob/97c60ed86fa105e18e1b5d2373576694863787be/src/pnumpy/__init__.py#L56
@mattip One thing that could be causing this -- I ran the latest benchmark code on Windows, and you're running it on Linux. asv
supports running benchmarks in individual subprocesses, and (I'm speculating) it may be doing that by default on Linux but not on Windows, or asv
is defaulting to a different approach for it on Windows vs. Linux. If that's the case, maybe we need to move the pn.initialize()
call at the top of the bench_ufunc.py
file, or e.g. have pnumpy auto-initialize when imported or detect when it's been forked (after pn.initialize()
has been called) and re-initialize.
Per @mattip, create a 'benchmarking' page in the documentation. The page should include the following information: