`floor_divide` returns different result for arrays of floating dtype on GPU and CPU devices

antonwolfy commented 6 months ago

In below example the behavior is different between CPU and GPU devices:

import numpy, dpctl, dpctl.tensor as dpt

dpctl.__version__
# Out: '0.17.0dev0+300.g7757857466'

a = dpt.arange(1, 10, dtype='f', device='gpu')
b = dpt.arange(1, 10, dtype='f', device='gpu')
dpt.floor_divide(a, b)
# Out: usm_ndarray([1., 1., 1., 1., 1., 1., 0., 1., 1.], dtype=float32)

a = dpt.arange(1, 10, dtype='f', device='cpu')
b = dpt.arange(1, 10, dtype='f', device='cpu')
dpt.floor_divide(a, b)
# Out: usm_ndarray([1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32)

na = numpy.arange(1, 10, dtype='f')
nb = numpy.arange(1, 10, dtype='f')
numpy.floor_divide(na, nb)
# Out: array([1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32)

So we have 0 as 7th element of the result array for GPU device and 1 on CPU and in numpy.

If we look into divide function output:

a = dpt.arange(1, 10, dtype='f')
b = dpt.arange(1, 10, dtype='f')

dpt.divide(a, b)
# Out:
# usm_ndarray([1.        , 1.        , 1.        , 1.        , 1.        ,
#             1.        , 0.99999994, 1.        , 1.        ], dtype=float32)

there will be the value 0.99999994 < 1. for GPU device. Based on the code:

auto div = in1 / in2;
return (div == resT(0)) ? div : resT(sycl::floor(div));

dpctl uses sycl::floor() function, which is intended to return

The value x rounded to an integral value using the round to negative infinity rounding mode

And I guess this is the reason why 0.99999994 rounds to 0 here. While in Python array API it states that:

Rounds the result of dividing each element x1_i of the input array x1 by the respective element x2_i of the input array x2 to the greatest (i.e., closest to +infinity) integer-value number that is not greater than the division result.

Thus I wonder if it is expected dpctl behavior or an issue.

ndgrigorian commented 6 months ago

The rounding mode is not exactly at fault here. Per array API

Rounds the result of dividing each element x1_i of the input array x1 by the respective element x2_i of the input array x2 to the greatest (i.e., closest to +infinity) integer-value number that is not greater than the division result

In this case, 1.0 > 0.99999994, so 0.0 is the appropriate result. So the behavior checks out per array API. The surprising result is caused by the division itself being inaccurate, possibly due to lower precision on GPU devices.

antonwolfy commented 6 months ago

@ndgrigorian, now I see, thank you for the clarification. Would it be worst then to have a special handling in the code? something like

if (sycl::fmod(in1, in2) == 0) {
    return resT(std::rint(in1/in2);
}

IntelPython / dpctl

`floor_divide` returns different result for arrays of floating dtype on GPU and CPU devices #1652