Quansight / pnumpy

Parallel NumPy seamlessly speeds up NumPy for large arrays (64K+ elements) with no change required to existing code.
https://quansight.github.io/pnumpy/
MIT License
57 stars 10 forks source link

Create 'benchmarking' section of documentation #110

Open jack-pappas opened 3 years ago

jack-pappas commented 3 years ago

Per @mattip, create a 'benchmarking' page in the documentation. The page should include the following information:

mattip commented 3 years ago

I ran the benchmarks on a intel machine after running sudo pyperf system tune, but did not see any improvement when activating multiple threads. Here is the machine.json and the compressed .asv/results directory.

{
    "arch": "x86_64",
    "cpu": "Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz",
    "machine": "benchmarker",
    "num_cpu": "8",
    "os": "Linux 4.15.0-74-generic",
    "ram": "65748452",
    "version": 1
}

benchmarker.tar.gz

mattip commented 3 years ago

The benchmarks ran for 2 hours on this machine

mattip commented 3 years ago

@jack-pappas @tdimitri: any thoughts why I do not see an improvement?

tdimitri commented 3 years ago

Matti, did you do...

pn.init()
pn.benchmark()

What are the numbers returned? Then now there is a parallel lexsort and a parallel sort.

mattip commented 3 years ago

No, I followed the instructions on the benchmarks README

asv run

Here is my result for pn.benchmark():

>>> pn.benchmark()
1000000 rows,bool,int8,int16,int32,int64,float32,float64,
a==b , 0.99, 1.00, 1.00, 1.15, 1.01, 1.15, 1.02,
a==5 , 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.02,
a+b, 1.01, 1.00, 1.00, 1.06, 1.01, 0.97, 1.00,
a+5, 1.13, 1.00, 1.01, 1.00, 1.07, 1.02, 1.05,
a/5, 1.00, 1.00, 1.00, 0.99, 1.00, 1.00, 1.00,
abs, 1.00, 1.00, 1.00, 0.93, 0.98, 1.00, 1.08,
isnan, 1.00, 1.01, 1.01, 1.00, 1.01, 1.02, 0.99,
sin, 1.00, 0.99, 1.00, 1.00, 1.00, 0.98, 1.00,
log, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00,
sum, 1.00, 1.00, 1.00, 1.00, 1.02, 1.00, 1.02,
min, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00,
mattip commented 3 years ago

Ahh, hangon, after pn.init() it gets better:

>>> pn.init()
>>> pn.benchmark()
1000000 rows,bool,int8,int16,int32,int64,float32,float64,
a==b , 6.79, 2.58, 2.59, 3.29, 6.67, 2.45, 6.14,
a==5 , 4.71, 1.81, 1.87, 3.00, 4.69, 1.97, 2.64,
a+b, 9.37, 2.31, 2.46, 3.14, 9.44, 2.89, 9.20,
a+5, 4.12, 2.33, 2.16, 2.75, 4.23, 1.85, 4.78,
a/5, 0.72, 0.86, 0.87, 0.91, 0.70, 4.08, 6.99,
abs, 4.02, 5.83, 6.53, 3.16, 4.00, 9.85,11.18,
isnan, 0.79, 0.70, 0.80, 0.74, 0.80, 1.96, 2.73,
sin, 4.30, 3.88, 3.95, 8.81, 5.32,21.15,60.16,
log, 1.25, 2.13, 2.17, 1.30, 1.58, 6.39, 3.05,
sum, 8.28, 1.01, 1.04, 1.00, 9.61, 6.45, 5.44,
min, 3.65,41.85,41.73,31.00, 3.66, 1.93, 2.64,
mattip commented 3 years ago

Why isn't that reflected in the ASV results?

tdimitri commented 3 years ago

I will check with Jack and review his benchmark, I did not work with him on his benchmark and I apologize for any confusion. The benchmarks are hard because we have not hooked the "initialization" functions yet (like ones, zeros, arange, etc). We also have not hooked the copy functions, copy with mask, etc. We also have not hooked the conversion functions. I spent the last 10 hours trying to figure out how to hook the conversion functions, calling PyArray_RegisterCastFunc.. but does not seem to work yet.

Your numbers above look good and expected. One dip is in division of integers because it converts from int to float64 and does so in the main thread, thus invalidating the other cores... which is why I am trying to hook more functions.

Ideally divide would "convert and divide" on the fly... but we also cannot hook that right now.

On a good note... there is pn.getitem() which acts like a[b] when a is an array, and b is a boolean or fancy index array. It runs in parallel. On another good note... I have reviewed so much numpy internal low level code, I understand it better and can at least suggest hooks.

jack-pappas commented 3 years ago

We're calling pn.initialize() within the ASV benchmarks: https://github.com/Quansight/numpy-threading-extensions/blob/97c60ed86fa105e18e1b5d2373576694863787be/benchmarks/bench_ufunc.py#L19

The current version of pn.initialize() just calls pn.init(): https://github.com/Quansight/numpy-threading-extensions/blob/97c60ed86fa105e18e1b5d2373576694863787be/src/pnumpy/__init__.py#L56

jack-pappas commented 3 years ago

@mattip One thing that could be causing this -- I ran the latest benchmark code on Windows, and you're running it on Linux. asv supports running benchmarks in individual subprocesses, and (I'm speculating) it may be doing that by default on Linux but not on Windows, or asv is defaulting to a different approach for it on Windows vs. Linux. If that's the case, maybe we need to move the pn.initialize() call at the top of the bench_ufunc.py file, or e.g. have pnumpy auto-initialize when imported or detect when it's been forked (after pn.initialize() has been called) and re-initialize.