bh107 / bohrium

Automatic parallelization of Python/NumPy, C, and C++ codes on Linux and MacOSX
http://www.bh107.org
Apache License 2.0
219 stars 31 forks source link

Wrong results when passing views to user kernel #573

Closed dionhaefner closed 5 years ago

dionhaefner commented 5 years ago

Following the examples, my initial version of the TDMA extmethod as a user kernel was something like this:

from string import Template

KERNEL = Template('''
#include <stddef.h>
#include <stdint.h>

void execute(
    $DTYPE *a,
    $DTYPE *b,
    $DTYPE *c,
    $DTYPE *d,
    $DTYPE *solution
){
    const uint64_t m = $SYS_DEPTH;
    #pragma omp parallel for
    for(size_t idx = 0; idx < $SIZE; idx += m) {
        $DTYPE cp[m];
        cp[0] = c[idx] / b[idx];
        solution[idx] = d[idx] / b[idx];

        for (int j=1; j < m; ++j) {
            const $DTYPE denom = b[idx + j] - a[idx + j] * cp[j-1];
            cp[j] = c[idx + j] / denom;
            solution[idx + j] = (d[idx + j] - a[idx + j] * solution[idx + j-1]) / denom;
        }

        for (int j=m-2; j >= 0; --j) {
            solution[idx + j] -= cp[j] * solution[idx + j+1];
        }
    }
}
'''.strip())

def tdma(a, b, c, d):
    import bohrium as bh
    a, b, c, d = (bh.array(k, order='C') for k in (a, b, c, d))
    ctype = bh.user_kernel.dtype_to_c99(a.dtype)
    kernel = KERNEL.substitute(DTYPE=ctype, SYS_DEPTH=a.shape[-1], SIZE=a.size)
    res = bh.empty_like(a)
    bh.user_kernel.execute(kernel, [a, b, c, d, res])
    return res

if __name__ == '__main__':
    import numpy as np
    import bohrium as bh

    a, b, c, d = bh.random.rand(4, 360 * 180, 120)
    assert np.allclose(
        bh.linalg.solve_tridiagonal(a, b, c, d),
        tdma(a, b, c, d)
    )

(the important line being a, b, c, d = (bh.array(k, order='C') for k in (a, b, c, d)))

This formulation returns an array full of NaN. Looking at the flags of e.g. the array a I got

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Note how everything is as expected, except OWNDATA: False, because of a, b, c, d = bh.random.rand(4, 360 * 180, 120). The problem was fixed by replacing

- a, b, c, d = (bh.array(k, order='C') for k in (a, b, c, d))
+ if not a.flags.owndata or not a.flags.c_contiguous:
+     a, b, c, d = (k.copy() for k in (a, b, c, d))

I don't quite understand why it wouldn't work when passing a view of a contiguous array. So this is either a bug, or we should adapt the examples in the docs to include a similar check.

madsbk commented 5 years ago

It is because a, b, c, and d points to the same base array when using:

 a, b, c, d = bh.random.rand(4, 360 * 180, 120)

You have to add either an offset in the kernel or use something like this instead:

a, b, c, d = (bh.random.rand(360 * 180, 120) for k in range(4))

Having said that, it might be a good idea to have an optional argument only_contiguous_operands=True so that execute() raise an error when it is set?

dionhaefner commented 5 years ago

I should have been clearer, sorry. The surprising behavior here is that

a, b, c, d = (bh.array(k, order='C') for k in (a, b, c, d))

is not sufficient to ensure data integrity (contrary to what is shown in the examples). A fix would either be the flag you suggest, better documentation, or making Bohrium add the correct offset automatically when passing a contiguous view into a base array to bh.user_kernel.execute.

madsbk commented 5 years ago

My first though was also to let bh.user_kernel.execute() correct the offset, but that will not work in OpenCL and CUDA easily. I think I will update the doc and add the only_contiguous_operands for now. Thanks