Dedicated code to copy array to C-contig/F-contig destinations

oleksandr-pavlyk commented 1 month ago

This PR adds specialized kernels to copy usm_ndarray to C-/F-contiguous destinations of the same shape and the same dtype.

It also adds dedicated kernels to copy batches of square matrices (which are views of F-contig matrices) to C-contiguous destinations, and batches of square matrices which are views of C-contig matrices to F-contiguous destinations. The intended usage is to speed-up conversion from C-contig batch of square matrices to F-contig batch of square matrices.

Tests are added.

[x] Have you provided a meaningful PR description?
[x] Have you added a test, reproducer or referred to an issue with a reproducer?
[x] Have you tested your changes locally for CPU and GPU devices?
[x] Have you made sure that new changes do not introduce compiler warnings?
[x] Have you checked performance impact of proposed changes?
[ ] Have you added documentation for your changes, if necessary?
[x] Have you added your changes to the changelog?
[ ] If this PR is a work in progress, are you opening the PR as a draft?

github-actions[bot] commented 1 month ago

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. :crossed_fingers:

github-actions[bot] commented 1 month ago

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_72 ran successfully. Passed: 894 Failed: 1 Skipped: 119

github-actions[bot] commented 1 month ago

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_73 ran successfully. Passed: 894 Failed: 1 Skipped: 119

coveralls commented 1 month ago

coverage: 87.907%. remained the same when pulling d0882278dcf7fb6c0e2ebfbce567cb2d6e69e65a on add-as-contig-specialization into 4d3ddf9de6359a69576551f7f74e9cf682d03201 on master.

oleksandr-pavlyk commented 1 month ago

Examples:

import dpctl.tensor as dpt
x = dpt.ones((3, 10, 10), order='F');
y = dpt.empty_like(x, order='C'); 
# now uses generic kernel to copy to contiguous destination
y[:] = x  

x2 = dpt.moveaxis(dpt.ones((10, 10, 3), order='F'), 2, 0)
# Because x2 has shape (3, 10, 10), and strides (100, 1, 10)
# x2 is a batch of F-contig square matrices, and the following code uses
# faster kernel for copying
y2 = dpt.asarray(x2, order='C')

Here is demonstration on laptop with Iris Xe integrated GPU:

Python 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.24.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.ones((3, 1000, 1000), order='F');

In [3]: y = dpt.empty_like(x, order='C');

In [4]: %timeit y[:] = x; y.sycl_queue.wait()
2.23 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit y[:] = x; y.sycl_queue.wait()
2.24 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: x2 = dpt.moveaxis(dpt.ones((1000, 1000, 3), order='F'), 2, 0)

In [7]: y2 = dpt.empty_like(x2, order='C')

In [8]: %timeit y2[:] = x2; y2.sycl_queue.wait()
1.32 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [9]: %timeit y2[:] = x2; y2.sycl_queue.wait()
1.3 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [10]: x3 = dpt.ones((3, 1000, 1000), order='F', dtype="i4")

In [11]: y3 = dpt.empty_like(x3, order='C', dtype="u4")

In [12]: %timeit y3[:] = x3; y3.sycl_queue.wait()
2.31 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [13]: %timeit y3[:] = x3; y3.sycl_queue.wait()
2.33 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

On GPU Max the difference between timing in In[12]/In[13] (about the same as legacy timing before this PR) and In[4]/In[5] is more pronounced (25%), as well as difference between In[12]/In[13] and In[8]/In[9].

github-actions[bot] commented 1 month ago

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_74 ran successfully. Passed: 894 Failed: 1 Skipped: 119

github-actions[bot] commented 1 month ago

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_75 ran successfully. Passed: 895 Failed: 0 Skipped: 119

github-actions[bot] commented 1 month ago

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_76 ran successfully. Passed: 895 Failed: 0 Skipped: 119

github-actions[bot] commented 1 month ago

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_75 ran successfully. Passed: 895 Failed: 0 Skipped: 119

github-actions[bot] commented 1 month ago

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_76 ran successfully. Passed: 894 Failed: 1 Skipped: 119

github-actions[bot] commented 1 month ago

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_77 ran successfully. Passed: 895 Failed: 0 Skipped: 119

vtavana commented 1 month ago

All tests for dpnp were passed using this branch

IntelPython / dpctl

Dedicated code to copy array to C-contig/F-contig destinations #1850