I've been testing dpnp on CPU with some standard NaN functions (nan_to_num, nansum), and my performance results seem to show that dpnp is quite slow single-threaded compared to NumPy. I was wondering if you had any insight to why this might be the case (e.g., if there is something specific dpnp does in handling NaNs), and/or if there was a fix for this.
Here are some of the scalability plots (number of threads vs. running time) comparing dpnp and NumPy (and Numba), for nan_to_num and nansum respectively:
While the dpnp scaling looks good, the single-threaded performance in particular is almost an order of magnitude worse. The test environment is an Intel Xeon Platinum 8380 processor, with 80 threads. Both tests were run on arrays with 6.4e9 float32s, taking the average (median) over 10 runs and discarding the first run (so the cache is warm).
Here was the code to generate the input array for all of these tests:
import numpy as np
from numpy.random import default_rng
N = 80000
rng = default_rng(42)
array_1 = rng.uniform(0, 1000, size=(N * N,)).astype(np.float32)
N_nan = N // 10
nan_choice = np.random.choice(N * N, N_nan, replace=False)
array_1[nan_choice] = np.nan
array_1 = array_1.reshape((N, N))
For dpnp, I ran array_1 = dpnp.asarray(array_1, device="cpu") before starting the tests (not included in the timing results). The times were measuring only array_out = np.nan_to_num(array_1) or array_out = dpnp.nan_to_num(array_1) (similarly for nansum).
Hi,
I've been testing dpnp on CPU with some standard NaN functions (nan_to_num, nansum), and my performance results seem to show that dpnp is quite slow single-threaded compared to NumPy. I was wondering if you had any insight to why this might be the case (e.g., if there is something specific dpnp does in handling NaNs), and/or if there was a fix for this.
Here are some of the scalability plots (number of threads vs. running time) comparing dpnp and NumPy (and Numba), for nan_to_num and nansum respectively:
While the dpnp scaling looks good, the single-threaded performance in particular is almost an order of magnitude worse. The test environment is an Intel Xeon Platinum 8380 processor, with 80 threads. Both tests were run on arrays with 6.4e9 float32s, taking the average (median) over 10 runs and discarding the first run (so the cache is warm).
Here was the code to generate the input array for all of these tests:
For dpnp, I ran
array_1 = dpnp.asarray(array_1, device="cpu")
before starting the tests (not included in the timing results). The times were measuring onlyarray_out = np.nan_to_num(array_1)
orarray_out = dpnp.nan_to_num(array_1)
(similarly fornansum
).Any help is much appreciated -- thanks!
Best, Jessica