[Bounty $30] Host memory not copied to device causing slowdown

Copied this bug over from here: https://stackoverflow.com/questions/39980333/opencl-aparapi-simple-reduction-slow-on-radeon

I am trying to code a simple reduction (in this case a sum) over a large double array in OpenCL. I have looked at online tutorials and found that this is essentially the way to solve my problem:

    #pragma OPENCL EXTENSION cl_khr_fp64 : enable

    typedef struct This_s{
       __global double *nums;
       int nums__javaArrayLength;
       __local double *buffer;
       __global double *res;
       int passid;
    }This;
    int get_pass_id(This *this){
       return this->passid;
    }
    __kernel void run(
       __global double *nums, 
       int nums__javaArrayLength, 
       __local double *buffer, 
       __global double *res, 
       int passid
    ){
       This thisStruct;
       This* this=&thisStruct;
       this->nums = nums;
       this->nums__javaArrayLength = nums__javaArrayLength;
       this->buffer = buffer;
       this->res = res;
       this->passid = passid;
       {
          int tid = get_local_id(0);
          int i = (get_group_id(0) * get_local_size(0)) + get_local_id(0);
          int gridSize = get_local_size(0) * get_num_groups(0);
          int n = this->nums__javaArrayLength;
          double cur = 0.0;
          for (; i<n; i = i + gridSize){
             cur = cur + this->nums[i];
          }
          this->buffer[tid]  = cur;
          barrier(CLK_LOCAL_MEM_FENCE);
          barrier(CLK_LOCAL_MEM_FENCE);
          if (tid<32){
             this->buffer[tid]  = this->buffer[tid] + this->buffer[(tid + 32)];
          }
          barrier(CLK_LOCAL_MEM_FENCE);
          if (tid<16){
             this->buffer[tid]  = this->buffer[tid] + this->buffer[(tid + 16)];
          }
          barrier(CLK_LOCAL_MEM_FENCE);
          if (tid<8){
             this->buffer[tid]  = this->buffer[tid] + this->buffer[(tid + 8)];
          }
          barrier(CLK_LOCAL_MEM_FENCE);
          if (tid<4){
             this->buffer[tid]  = this->buffer[tid] + this->buffer[(tid + 4)];
          }
          barrier(CLK_LOCAL_MEM_FENCE);
          if (tid<2){
             this->buffer[tid]  = this->buffer[tid] + this->buffer[(tid + 2)];
          }
          barrier(CLK_LOCAL_MEM_FENCE);
          if (tid<1){
             this->buffer[tid]  = this->buffer[tid] + this->buffer[(tid + 1)];
          }
          barrier(CLK_LOCAL_MEM_FENCE);
          if (tid==0){
             this->res[get_group_id(0)]  = this->buffer[0];
          }
          return;
       }
    }

If you are wondering about the strange this, that is an (unfortunately necessary) artifact of aparapi, which I use to translate Java to OpenCL.

My kernel produces the correct results and, on reasonably beefy Nvidia Hardware, it is about 10x faster than a sequential sum in Java. On a Radeon R9 280 however it is comparable in performance to the simple Java code.

I have profiled the kernel with CodeXL. It tells me that MemUnitBusy is at just 6%. Why is it so low?

Update: Turns out OpenCL is not (directly) at fault, but aparapis buffer management is.

I tried out the exact same kernel without aparapi, and the performance is good. It turns bad as soon as I use CL_MEM_USE_HOST_PTR, which is sadly the only option when using aparapi. It seems AMD is not copying host memory to the device with that option, even after several "warmup" runs.

Syncleus / aparapi

[Bounty $30] Host memory not copied to device causing slowdown #107