ddemidov / vexcl

VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP
http://vexcl.readthedocs.org
MIT License
701 stars 81 forks source link

Errors using CL_DEVICE_TYPE_CPU on Mac OSX Yosemite #158

Open lajash opened 9 years ago

lajash commented 9 years ago

Running the compiler_bug.cpp from the command line using the following : g++ -o compiler_bug compiler_bug.cpp -std=c++0x -I OpenHeaders -framework OpenCL && ./compiler_bug Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz OpenCL compilation error CVMS_ERROR_COMPILER_FAILURE: CVMS compiler has crashed or hung building an element. clBuildProgram

Running stencil operators on the CPU also causes crashes. The same code runs on the GPU.

I'm testing this on a Macbook Pro with an i7 CPU and a ATI Radeon HD 6750M GPU.

ddemidov commented 9 years ago

Can you share the source of compiler_bug.cpp here?

lajash commented 9 years ago

Hi Denis,

Its the one from your gist - https://gist.github.com/ddemidov/8681608 https://gist.github.com/ddemidov/8681608

Thanks, Rajesh

On Jan 29, 2015, at 4:54 PM, Denis Demidov notifications@github.com wrote:

Can you share the source of compiler_bug.cpp here?

— Reply to this email directly or view it on GitHub https://github.com/ddemidov/vexcl/issues/158#issuecomment-72009367.

ddemidov commented 9 years ago

In that case its a known issue (see #92). That issue have never been resolved as far as I know. I would raise a ticket with Apple support, since the code in the gist is standard OpenCL and should work.

Edit: Also, I don't have access to a MacOSX machine, so there is not much I can do here.

lajash commented 9 years ago

Ah ok.. I’ve looked at #92. Will change the filter to CL_DEVICE_TYPE_GPU for the Mac platform for now. Not sure if this applies to the new Macs though.

Thanks for your help,

Rajesh.

On Jan 29, 2015, at 5:03 PM, Denis Demidov notifications@github.com wrote:

In that case its a known issue (see #92 https://github.com/ddemidov/vexcl/issues/92). That issue have never been resolved as far as I know. I would raise a ticket with Apple support, since the code in the gist is standard OpenCL and should work.

— Reply to this email directly or view it on GitHub https://github.com/ddemidov/vexcl/issues/158#issuecomment-72010291.

lajash commented 9 years ago

Closing this issue as it seems to be a bug in the Apple OpenCL framework. FYI, the 10.10.2 update also does not fix this.

ddemidov commented 9 years ago

Thank you for the info! If you are able to find a workaround, I would be glad to either introduce it to vexcl or accept a pull request.

lajash commented 9 years ago

Looking into it… will keep you updated.

On Jan 29, 2015, at 6:38 PM, Denis Demidov notifications@github.com wrote:

Thank you for the info! If you are able to find a workaround, I would be glad to either introduce it to vexcl or accept a pull request.

— Reply to this email directly or view it on GitHub https://github.com/ddemidov/vexcl/issues/158#issuecomment-72020767.

lajash commented 9 years ago

Denis, Interestingly, the default programs provided in the Xcode samples all run fine on the CPU. Just looked at the compiler and program options and don't see anything special there either.

Could it be workgroups / queues related ? Never mind, will check it myself without bothering you.

ddemidov commented 9 years ago

I think Apple's OpenCL implementation does not support workgroups of more than one workitem on CPUs (vexcl uses this restriction for kernels on CPU devices), but in the gist the kernel is never launched since it fails the compilation step.

lajash commented 9 years ago

Here's what works on the CPU thus far ... (taken from your examples, of course )

#include <iostream>
#include <vector>
#include <string>
#include <stdexcept>

#define __CL_ENABLE_EXCEPTIONS
#include <vexcl/vexcl.hpp>

//---------------------------------------------------------------------------
int main() 
{
    const size_t n = 1024 * 1024;
    vex::Context ctx( vex::Filter::Type(CL_DEVICE_TYPE_CPU) );

    std::vector<double> a(n, 1.0);
    std::vector<double> c(n, 0.5);

    std::vector<double> results(n);

    vex::vector<double> A(ctx.queue(), a);
    vex::vector<double> B(ctx.queue(), n);
    vex::vector<double> C(ctx.queue(), c);

    A = (B + C) / 5;
    vex::copy(A, results);
    std::cout << "Size: " << results.size() << std::endl;
    std::cout << "First 2: " << results[0] << "," << results[1] << std::endl;

    VEX_FUNCTION(double, squared_radius, (double, x)(double, y),
    return x * x + y * y;
    );

    A = sqrt(squared_radius(B, C));
    vex::copy(A, results);
    std::cout << "Size: " << results.size() << std::endl;
    std::cout << "First 2: " << results[0] << "," << results[1] << std::endl;    

    VEX_STENCIL_OPERATOR(S, /*return type:*/double, /*window width:*/3, /*center:*/1,
    "return sin(X[0] - X[-1]) + sin(X[1] - X[0]);", ctx);
    A = S(A);

    vex::copy(A, results);
    std::cout << "Size: " << results.size() << std::endl;
    std::cout << "First 2: " << results[0] << "," << results[1] << std::endl;    
}

I'm going to try writing a few custom kernels that use the stencil window to see if I can break anything, Let me know if this makes any sense.

ddemidov commented 9 years ago

I think it would make more sense to run unit tests distributed with vexcl. You can do this with

cd $VEXCL_ROOT
mkdir build
cd build
cmake ..
make
OCL_DEVICE=i7 VEXCL_SHOW_KERNELS=1 make test

After that the test log may be found at Testing/Temporary/LastTest.log. If you upload it to e.g. gist.github.com, we could concentrate on the failing tests.

lajash commented 9 years ago

Hi Denis, here goes ... https://gist.github.com/lajash/59d9a2f489d2aa05f1e9

ddemidov commented 9 years ago

So the failing kernels I can see are:

#if defined(cl_khr_fp64)
#  pragma OPENCL EXTENSION cl_khr_fp64: enable
#elif defined(cl_amd_fp64)
#  pragma OPENCL EXTENSION cl_amd_fp64: enable
#endif

kernel void vexcl_vector_kernel
(
  ulong n,
  global double * prm_1
)
{
  ulong chunk_size  = (n + get_global_size(0) - 1) / get_global_size(0);
  ulong chunk_start = get_global_id(0) * chunk_size;
  ulong chunk_end   = chunk_start + chunk_size;
  if (n < chunk_end) chunk_end = n;
  for(ulong idx = chunk_start; idx < chunk_end; ++idx)
  {
    prm_1[idx] = 42;
  }
}

This is the kernel from the gist above. Btw, I've had another idea worth testing about this kernel, see below.

About the failing assignment test: it occurred to me this could be due to a type mismatch: prm_1[idx] is double, and 42 is int. Could you please try to compile and run the code from the gist again? I have updated it to use 42.0 instead of 42.

lajash commented 9 years ago

Will do … will get back to you shortly…

On Jan 30, 2015, at 12:52 PM, Denis Demidov notifications@github.com wrote:

So the failing kernels I can see are:

In vector/multivector arithmetics:

if defined(cl_khr_fp64)

pragma OPENCL EXTENSION cl_khr_fp64: enable

elif defined(cl_amd_fp64)

pragma OPENCL EXTENSION cl_amd_fp64: enable

endif

kernel void vexcl_vector_kernel ( ulong n, global double * prm_1 ) { ulong chunk_size = (n + get_global_size(0) - 1) / get_global_size(0); ulong chunk_start = get_global_id(0) * chunk_size; ulong chunk_end = chunk_start + chunk_size; if (n < chunk_end) chunk_end = n; for(ulong idx = chunk_start; idx < chunk_end; ++idx) { prm_1[idx] = 42; } } This is the kernel from the gist https://gist.github.com/ddemidov/8681608 above. Btw, I've had another idea worth testing about this kernel, see below.

Boost.Compute integration example (sort function call), which fails due to wrong workgroup size (Apple only supports workgroups with single item on CPUs). I would run unit tests from boost.compute and report https://github.com/kylelutz/compute/issues/new any failures to @kylelutz https://github.com/kylelutz. FFT test, which could also be due to wrong workgroup size. I'll see if using a workgroup of single item makes any sense there. About the failing assignment test: it occurred to me this could be due to a type mismatch: prm_1[idx] is double, and 42 is int. Could you please try to compile and run the code from the gist https://gist.github.com/ddemidov/8681608 again? I have updated https://gist.github.com/ddemidov/8681608#file-compiler_bug-cpp-L84 it to use 42.0 instead of 42.

— Reply to this email directly or view it on GitHub https://github.com/ddemidov/vexcl/issues/158#issuecomment-72163707.

lajash commented 9 years ago

After using the updated compiler_bug.cpp from your gist, Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz OpenCL compilation error CVMS_ERROR_SERVICE_FAILURE: CVMS compiler has crashed or hung managing the service. clBuildProgram

So basically, it makes no difference.

ddemidov commented 9 years ago

I don't see anything wrong with this kernel, and it does work with any other OpenCL platform I have access to. I believe nothing left here but opening an issue with Apple support.

Regarding the FFT issue: it does work correctly when workgroup size is set set to 1. So could you please check if fft tests are passing for you with branch issue-158-fft?

Note however that (according to examples/fft_benchmark.cpp) VexCL's implementaion of FFT is about two orders of magnitude slower that fftw on a CPU, so there is probably no reason to use it with a CPU anyway.

lajash commented 9 years ago

You're right .... there's an issue with the quantum of data being transferred. See this gist https://gist.github.com/lajash/1645b473676633b35d9e

NDEnqueKernel issue with larger dataset.

lajash commented 9 years ago

Checking fft now ... you're right, makes no sense to use vex::fft if no GPU involved. Will you be working on optimizing it in the future ?

ddemidov commented 9 years ago

Re fft optimization: I don't think it makes sense when fftw is available. On a CPU one can just map the device memory to a host pointer and then use fftw (or any other host-side algorithm) on a device vectors (see the example here). Also, the FFT implementation was provided by @neapel, so he could probably chime in here.

lajash commented 9 years ago

New test log added here.... https://gist.github.com/lajash/991c1bd6a1fc9d3ffa95

Doesn't look like it fixed anything though. :(

lajash commented 9 years ago

Quick update ... the code at https://gist.github.com/lajash/1645b473676633b35d9e runs on my Mac now ... just tried running it multiple times and voila, it runs in 1 out of 4 tries ... but it's extremely slow !! So it looks like its Apple's icd that may have a issue with the CPU . This works on all other platforms I presume ?

ddemidov commented 9 years ago

Your result vector is 100 times less in size than A. So you should get an out of boundary error and a segfault here.

Stencil operators use slow path when run on a CPU, which may be significant for wide stencils like yours. Function based variant works 1.5 faster on my CPU than stencil-based one, and only slightly slower (about 15%) on a GPU.

lajash commented 9 years ago

Thanks Denis …. will use function variants …

On Jan 30, 2015, at 2:36 PM, Denis Demidov notifications@github.com wrote:

Your result vector is 100 times less https://gist.github.com/lajash/1645b473676633b35d9e#file-stdev-cpp-L18 in size than A. So you should get an out of boundary error and a segfault here https://gist.github.com/lajash/1645b473676633b35d9e#file-stdev-cpp-L55.

Stencil operators use slow path when run on a CPU, which may be significant for wide stencils like yours. Function based variant https://gist.github.com/ddemidov/4c126b012e4ebf669b51#file-stdev-cpp-L64-L83 works 1.5 faster on my CPU than stencil-based one, and only slightly slower (about 15%) on a GPU.

— Reply to this email directly or view it on GitHub https://github.com/ddemidov/vexcl/issues/158#issuecomment-72172812.