I am trying to code a simple reduction (in this case a sum) over a large double array in OpenCL. I have looked at online tutorials and found that this is essentially the way to solve my problem:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
typedef struct This_s{
__global double *nums;
int nums__javaArrayLength;
__local double *buffer;
__global double *res;
int passid;
}This;
int get_pass_id(This *this){
return this->passid;
}
__kernel void run(
__global double *nums,
int nums__javaArrayLength,
__local double *buffer,
__global double *res,
int passid
){
This thisStruct;
This* this=&thisStruct;
this->nums = nums;
this->nums__javaArrayLength = nums__javaArrayLength;
this->buffer = buffer;
this->res = res;
this->passid = passid;
{
int tid = get_local_id(0);
int i = (get_group_id(0) * get_local_size(0)) + get_local_id(0);
int gridSize = get_local_size(0) * get_num_groups(0);
int n = this->nums__javaArrayLength;
double cur = 0.0;
for (; i<n; i = i + gridSize){
cur = cur + this->nums[i];
}
this->buffer[tid] = cur;
barrier(CLK_LOCAL_MEM_FENCE);
barrier(CLK_LOCAL_MEM_FENCE);
if (tid<32){
this->buffer[tid] = this->buffer[tid] + this->buffer[(tid + 32)];
}
barrier(CLK_LOCAL_MEM_FENCE);
if (tid<16){
this->buffer[tid] = this->buffer[tid] + this->buffer[(tid + 16)];
}
barrier(CLK_LOCAL_MEM_FENCE);
if (tid<8){
this->buffer[tid] = this->buffer[tid] + this->buffer[(tid + 8)];
}
barrier(CLK_LOCAL_MEM_FENCE);
if (tid<4){
this->buffer[tid] = this->buffer[tid] + this->buffer[(tid + 4)];
}
barrier(CLK_LOCAL_MEM_FENCE);
if (tid<2){
this->buffer[tid] = this->buffer[tid] + this->buffer[(tid + 2)];
}
barrier(CLK_LOCAL_MEM_FENCE);
if (tid<1){
this->buffer[tid] = this->buffer[tid] + this->buffer[(tid + 1)];
}
barrier(CLK_LOCAL_MEM_FENCE);
if (tid==0){
this->res[get_group_id(0)] = this->buffer[0];
}
return;
}
}
If you are wondering about the strange this, that is an (unfortunately necessary) artifact of aparapi, which I use to translate Java to OpenCL.
My kernel produces the correct results and, on reasonably beefy Nvidia Hardware, it is about 10x faster than a sequential sum in Java. On a Radeon R9 280 however it is comparable in performance to the simple Java code.
I have profiled the kernel with CodeXL. It tells me that MemUnitBusy is at just 6%. Why is it so low?
Update: Turns out OpenCL is not (directly) at fault, but aparapis buffer management is.
I tried out the exact same kernel without aparapi, and the performance is good. It turns bad as soon as I use CL_MEM_USE_HOST_PTR, which is sadly the only option when using aparapi. It seems AMD is not copying host memory to the device with that option, even after several "warmup" runs.
Copied this bug over from here: https://stackoverflow.com/questions/39980333/opencl-aparapi-simple-reduction-slow-on-radeon
I am trying to code a simple reduction (in this case a sum) over a large double array in OpenCL. I have looked at online tutorials and found that this is essentially the way to solve my problem:
If you are wondering about the strange
this
, that is an (unfortunately necessary) artifact of aparapi, which I use to translate Java to OpenCL.My kernel produces the correct results and, on reasonably beefy Nvidia Hardware, it is about 10x faster than a sequential sum in Java. On a Radeon R9 280 however it is comparable in performance to the simple Java code.
I have profiled the kernel with CodeXL. It tells me that MemUnitBusy is at just 6%. Why is it so low?
Update: Turns out OpenCL is not (directly) at fault, but aparapis buffer management is.
I tried out the exact same kernel without aparapi, and the performance is good. It turns bad as soon as I use CL_MEM_USE_HOST_PTR, which is sadly the only option when using aparapi. It seems AMD is not copying host memory to the device with that option, even after several "warmup" runs.