Quansight / pnumpy

Parallel NumPy seamlessly speeds up NumPy for large arrays (64K+ elements) with no change required to existing code.
https://quansight.github.io/pnumpy/
MIT License
57 stars 10 forks source link

something off with bitwise_or #19

Open tdimitri opened 4 years ago

tdimitri commented 4 years ago

the speed of bitwise_or seems off compared to when i test with riptable

on my computer, in riptable, the bitwise_or of 1million int32 takes 59 micoseconds when i use the same code to speed it up for numpy (we want numpy as fast as riptable or faster...) that's where I am putting my energy..

I get 1.51ms... something is off, not sure what. I suspect something internal about numpy, but I am not sure. It could be the hook messed it up? but then i tried without hooking and got the same slower speed.

a=np.arange(1_000_000)
b=a.copy()
np.bitwise_or(a,a)
np.bitwise_or(a,a, out=b)

more information on this -- it works fine for int8,16,and 64 -- it is int32 or uint32 is somehow different. get called back for int8/16/64 as I can see from speeds below. do not get called back for int32. Also not sure why numpy takes 503us for int8 and 589us for int64 which is 4 times larger (should take about 4 times longer), but not sure we care since once properly taken over, this will be solved.

In [1]: import numpy as np

In [2]: a=np.arange(1_000_000, dtype=np.int8)

In [3]: %timeit np.bitwise_or(a,a, out=a)
503 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: a=np.arange(1_000_000, dtype=np.int16)

In [5]: %timeit np.bitwise_or(a,a, out=a)
510 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]: a=np.arange(1_000_000, dtype=np.int32)

In [7]: %timeit np.bitwise_or(a,a, out=a)
511 µs ± 7.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: a=np.arange(1_000_000, dtype=np.int64)

In [9]: %timeit np.bitwise_or(a,a, out=a)
589 µs ± 34.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: a=np.arange(1_000_000, dtype=np.uint32)

In [11]: %timeit np.bitwise_or(a,a, out=a)
506 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [12]:

In [12]: import numpy as np; import _fast_numpy_loops as fa

In [13]: fa.initialize()
taking over func add
taking over func subtract
taking over func multiply
taking over func true_divide
taking over func floor_divide
taking over func power
taking over func remainder
taking over func logical_and
taking over func logical_or
taking over func bitwise_and
taking over func bitwise_or
taking over func bitwise_xor

In [14]: %timeit np.bitwise_or(a,a, out=a)
--> 514 µs ± 7.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
** dont think getting called here

In [15]: a=np.arange(1_000_000, dtype=np.uint64)

In [16]: %timeit np.bitwise_or(a,a, out=a)
82.1 µs ± 2.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [17]: a=np.arange(1_000_000, dtype=np.int8)

In [18]: %timeit np.bitwise_or(a,a, out=a)
30.5 µs ± 2.57 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
mattip commented 3 years ago

Is this slowdown still relevant?