IntelPython / dpctl

Python SYCL bindings and SYCL-based Python Array API library
https://intelpython.github.io/dpctl/
Apache License 2.0
97 stars 29 forks source link

Reshape improvements #1677

Closed oleksandr-pavlyk closed 1 month ago

oleksandr-pavlyk commented 1 month ago

Closes gh-1664.

Improves performance of dpt.reshape(X, new_shape, order="F") when copy is needed.

github-actions[bot] commented 1 month ago

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. :crossed_fingers:

github-actions[bot] commented 1 month ago

Array API standard conformance tests for dpctl=0.17.0dev0=py310h15de555_355 ran successfully. Passed: 887 Failed: 18 Skipped: 91

coveralls commented 1 month ago

Coverage Status

coverage: 87.951% (+0.003%) from 87.948% when pulling 9d2633fb14f33a5dcad762fa56023e7c6b34708d on reshape-improvements into 7bc3124e879029b2defe99b6bfb2166505048aa3 on master.

antonwolfy commented 1 month ago

53 tests affecting call of gemm function are failed in dpnp with that PR. But I believe this might be due to some unproper internal logic in dpnp.

The first issue I've found is incorrect result returned below:

import dpnp
from dpnp.dpnp_utils.dpnp_utils_linearalgebra import _define_contig_flag

a = dpnp.arange(3).reshape((1, 3, 1))
_define_contig_flag(a)
# Out: False

but it's expected to return True, because a is both C-contiguous or F-contiguous along last two dimensions. It will result in unexpected extra memory allocation to copy array a to temporary C-contiguous array.

The second one is somewhere in _gemm_batch implementation:

import dpctl, dpctl.tensor as dpt
import dpnp.backend.extensions.blas._blas_impl as bi

a = dpt.reshape(dpt.arange(60, dtype='f4'), (5, 4, 3))
b = dpt.reshape(dpt.arange(3, dtype='f4'), (1, 3, 1))
b = dpt.copy(b)
c = dpt.zeros((5, 4, 1))

a.strides, b.strides, c.strides
# Out: ((12, 3, 1), (3, 1, 1), (4, 1, 1))

ev,  _, _ = bi._gemm_batch(a.sycl_queue, a, b, c)
ev.wait()

c
# Out:
# usm_ndarray([[[ 5.],
#               [14.],
#               [23.],
#               [32.]],
# 
#              [[ 0.],
#               [ 0.],
#               [ 0.],
#               [ 0.]],
# 
#              [[ 0.],
#               [ 0.],
#               [ 0.],
#               [ 0.]],
# 
#              [[ 0.],
#               [ 0.],
#               [ 0.],
#               [ 0.]],
# 
#              [[ 0.],
#               [ 0.],
#               [ 0.],
#               [ 0.]]], dtype=float32)

@vtavana , could you please look on that?

vtavana commented 1 month ago

53 tests affecting call of gemm function are failed in dpnp with that PR. But I believe this might be due to some unproper internal logic in dpnp.

The necessary changes are implemented in dpnp-gh-1828. Relevant tests in dpnp are now passed with both master branch of dpctl and this branch.

ndgrigorian commented 1 month ago

The first issue I've found is incorrect result returned below:

import dpnp
from dpnp.dpnp_utils.dpnp_utils_linearalgebra import _define_contig_flag

a = dpnp.arange(3).reshape((1, 3, 1))
_define_contig_flag(a)
# Out: False

but it's expected to return True, because a is both C-contiguous or F-contiguous along last two dimensions. It will result in unexpected extra memory allocation to copy array a to temporary C-contiguous array.

First case is working correctly in dpctl

In [1]: import dpctl.tensor as dpt, dpctl, numpy as np

In [2]: x = dpt.reshape(dpt.arange(10), (1, 10, 1))

In [3]: x.flags.contiguous
Out[3]: True

In [4]: x.flags
Out[4]:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  WRITABLE : True

@vtavana does dpnp-gh-1828 address this case too?

vtavana commented 1 month ago

First case is working correctly in dpctl

In [1]: import dpctl.tensor as dpt, dpctl, numpy as np

In [2]: x = dpt.reshape(dpt.arange(10), (1, 10, 1))

In [3]: x.flags.contiguous
Out[3]: True

In [4]: x.flags
Out[4]:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  WRITABLE : True

@vtavana does dpnp-gh-1828 address this case too?

Yes, both examples provided by @antonwolfy also work fine in dpnp-gh-1828.

As a side note, _define_contig_flag from dpnp is used in batch calculation. And the goal is to check if each 2D array that forms the N-D array is f-contiguous or c-contiguous. So, we do not use built-in flag there.