It also improves the performance for some special cases.
>>> import dpnp
>>> size = 4096
>>> device="gpu"
>>> a = dpnp.ones((size, size), order="F", device=device)
>>> b = dpnp.ones((size, size), order="F", device=device)
>>> %timeit dpnp.matmul(a, b)
New implementation
Iris Xe: 142 ms ± 6.03 ms
Intel Core: 1.81 s ± 383 ms
Old dpnp
Iris Xe: 156 ms ± 3.38 ms
Intel Core: 2.07 s ± 69.2 ms
[x] Have you provided a meaningful PR description?
[x] Have you added a test, reproducer or referred to issue with a reproducer?
[x] Have you tested your changes locally for CPU and GPU devices?
[x] Have you made sure that new changes do not introduce compiler warnings?
[x] Have you checked performance impact of proposed changes?
[ ] If this PR is a work in progress, are you filing the PR as a draft?
This PR resolved issue #1871.
It also improves the performance for some special cases.