Closed oleksandr-pavlyk closed 1 month ago
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. :crossed_fingers:
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_72 ran successfully. Passed: 894 Failed: 1 Skipped: 119
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_73 ran successfully. Passed: 894 Failed: 1 Skipped: 119
Examples:
import dpctl.tensor as dpt
x = dpt.ones((3, 10, 10), order='F');
y = dpt.empty_like(x, order='C');
# now uses generic kernel to copy to contiguous destination
y[:] = x
x2 = dpt.moveaxis(dpt.ones((10, 10, 3), order='F'), 2, 0)
# Because x2 has shape (3, 10, 10), and strides (100, 1, 10)
# x2 is a batch of F-contig square matrices, and the following code uses
# faster kernel for copying
y2 = dpt.asarray(x2, order='C')
Here is demonstration on laptop with Iris Xe integrated GPU:
Python 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.24.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import dpctl.tensor as dpt
In [2]: x = dpt.ones((3, 1000, 1000), order='F');
In [3]: y = dpt.empty_like(x, order='C');
In [4]: %timeit y[:] = x; y.sycl_queue.wait()
2.23 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit y[:] = x; y.sycl_queue.wait()
2.24 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: x2 = dpt.moveaxis(dpt.ones((1000, 1000, 3), order='F'), 2, 0)
In [7]: y2 = dpt.empty_like(x2, order='C')
In [8]: %timeit y2[:] = x2; y2.sycl_queue.wait()
1.32 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: %timeit y2[:] = x2; y2.sycl_queue.wait()
1.3 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [10]: x3 = dpt.ones((3, 1000, 1000), order='F', dtype="i4")
In [11]: y3 = dpt.empty_like(x3, order='C', dtype="u4")
In [12]: %timeit y3[:] = x3; y3.sycl_queue.wait()
2.31 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [13]: %timeit y3[:] = x3; y3.sycl_queue.wait()
2.33 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
On GPU Max the difference between timing in In[12]/In[13] (about the same as legacy timing before this PR) and In[4]/In[5] is more pronounced (25%), as well as difference between In[12]/In[13] and In[8]/In[9].
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_74 ran successfully. Passed: 894 Failed: 1 Skipped: 119
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_75 ran successfully. Passed: 895 Failed: 0 Skipped: 119
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_76 ran successfully. Passed: 895 Failed: 0 Skipped: 119
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_75 ran successfully. Passed: 895 Failed: 0 Skipped: 119
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_76 ran successfully. Passed: 894 Failed: 1 Skipped: 119
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_77 ran successfully. Passed: 895 Failed: 0 Skipped: 119
All tests for dpnp
were passed using this branch
This PR adds specialized kernels to copy
usm_ndarray
to C-/F-contiguous destinations of the same shape and the same dtype.It also adds dedicated kernels to copy batches of square matrices (which are views of F-contig matrices) to C-contiguous destinations, and batches of square matrices which are views of C-contig matrices to F-contiguous destinations. The intended usage is to speed-up conversion from C-contig batch of square matrices to F-contig batch of square matrices.
Tests are added.