resolve gh-1871 - Githubissues

This PR resolved issue #1871.

It also improves the performance for some special cases.

>>> import  dpnp
>>> size = 4096
>>> device="gpu"
>>> a = dpnp.ones((size, size), order="F", device=device)
>>> b = dpnp.ones((size, size), order="F", device=device)
>>> %timeit dpnp.matmul(a, b) 

New implementation
Iris Xe: 142 ms ± 6.03 ms 
Intel Core: 1.81 s ± 383 ms 

Old dpnp
Iris Xe: 156 ms ± 3.38 ms 
Intel Core: 2.07 s ± 69.2 ms

[x] Have you provided a meaningful PR description?
[x] Have you added a test, reproducer or referred to issue with a reproducer?
[x] Have you tested your changes locally for CPU and GPU devices?
[x] Have you made sure that new changes do not introduce compiler warnings?
[x] Have you checked performance impact of proposed changes?
[ ] If this PR is a work in progress, are you filing the PR as a draft?

IntelPython / dpnp

resolve gh-1871 #1872