fjarri / reikna

Pure Python GPGPU library
http://reikna.publicfields.net/
MIT License
164 stars 16 forks source link

Scan with offset arrays has inconsistent requirements #32

Closed robertmaxton42 closed 6 years ago

robertmaxton42 commented 6 years ago

Hopefully a more intelligent question this time!

When called in a plan on an array with a nonzero offset, Scan gives inconsistent requirements for its output array. If given an output array with the same offset, it fails at plan.computation_call(transpose_from, output, transposed_scanned), which apparently expects offset=0; if given an output with zero offset, then it fails at argnames = self._process_computation_arguments(signature, args, kwds), where it's expecting an offset equal to the original array's.

A minimal reproduction follows:

import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as gp
import reikna.cluda as cluda
from reikna.cluda import Snippet
from reikna.algorithms import Scan, Predicate, predicate_sum
from reikna.helpers import *
from reikna.core import *

api = cluda.cuda_api()
thr = api.Thread(pycuda.autoinit.context)

class DummyComp(Computation):
    def __init__(self, arr):
        Computation.__init__(self, [
            Parameter('out', Annotation(Type(arr.dtype, (arr.shape[0] + 1, arr.shape[1]), offset=arr.strides[0]), 'o')),
            Parameter('arr', Annotation(arr, 'i'))
        ])
    def _build_plan(self, plan_factory, device_params, out, arr):
        plan = plan_factory()

        template = template_from(
            """
            <%def name='copyoffset(kernel_declaration, k_out, k_arr)'>
            ${kernel_declaration}
            {
                VIRTUAL_SKIP_THREADS;
                VSIZE_T i = virtual_global_id(0);
                for(VSIZE_T j = 0; j < ${k_arr.shape[1]}; j++){
                    ${k_arr.store_idx}(i, j, ${k_out.load_idx}(i, j));
                }
            }
            </%def>
            """)

        plan.kernel_call(
            template.get_def('copyoffset'),
            [out, arr],
            global_size=(arr.shape[0])
        )

        return plan

class TestComp(Computation):
    def __init__(self, arr):
        Computation.__init__(self, [
                Parameter('arr', Annotation(arr, 'i'))
            ])
    def _build_plan(self, plan_factory, device_params, arr):
        plan = plan_factory()

        dummy = DummyComp(arr)
        temp = plan.temp_array_like(dummy.parameter.out)

        scan = Scan(temp, predicate_sum(np.ubyte), axes=0)

        #Uncommenting this line throws:
        #TypeError: Got Annotation(Type(uint8, shape=(33, 32), strides=(32, 1), offset=32), role='o') for 
        #'output', expected Annotation(Type(uint8, shape=(33, 32), strides=(32, 1), offset=0), role='o')
        #temp2 = plan.temp_array_like(scan.parameter.output) 

        #Uncommenting this line throws:
        #TypeError: Got Annotation(Type(uint8, shape=(33, 32), strides=(32, 1), offset=0), role='io') for 
        #'output', expected Annotation(Type(uint8, shape=(33, 32), strides=(32, 1), offset=32), role='o')
        #temp2 = plan.temp_array(temp.shape, temp.dtype, offset = 0)

        plan.computation_call(dummy, temp, arr)
        plan.computation_call(scan, temp2, temp)

        return plan

arr = gp.to_gpu(np.ones((32, 32), np.ubyte))
out = gp.zeros_like(arr)
tester = TestComp(arr).compile(thr)
fjarri commented 6 years ago

Custom offsets/strides are kind of an experimental feature, and not all built-in computations support them (partially because I have not decided on a uniform way of specifying them).

In your code the correct line would be the first one, because it requests a temporary array with the offset equal to that of the output parameter of Scan, as it should. Since you're scanning over an outer axis, Scan calls three kernels - transpose, scan itself, and transpose again. The Transpose computation only takes the input array as a parameter, and its expected output array is generated based only on the input's shape and dtype. Thus the second transpose expects something with offset=0 as an output, but it gets passed an array with offset!=0 in the _build_plan() method of Scan (because Scan uses exactly the same array as the input and the output, including the strides and the offset), and that's where the error is thrown.

Now this part is easy to fix (I will push a commit shortly). The second problem is that GPUArray does not support custom offsets. There seems to be a way to still do it, by passing a modified memory pointer to it, but I am not 100% sure it won't cause problems somewhere else; I need to test it. If your code can be safely switched to OpenCL, you can test offsets there.

fjarri commented 6 years ago

BTW, in your Dummy computation, you, perhaps, meant to write

${k_out.store_idx}(i, j, ${k_arr.load_idx}(i, j));
fjarri commented 6 years ago

Actually it seems that supporting offsets in CUDA arrays is easier than I expected. Could you check if your code works now?

robertmaxton42 commented 6 years ago

Indeed I did. And... yup, it works now! Thanks!

fjarri commented 6 years ago

No problem, closing the issue then.