ddemidov / vexcl

VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP
http://vexcl.readthedocs.org
MIT License
701 stars 82 forks source link

Cannot run examples on Mac OS X ; Exception logic_error is raised #33

Closed Quanteek closed 11 years ago

Quanteek commented 11 years ago

Hello

I am trying to run Vexcl on a Mac Book Pro with Snow Leopard.

I compiled it with g++-4.7 and boost 1.53.

I can run the examples/devlist with success but if I run examples/benchmark I crash immediately on the line a += b + c * d; of benchmark_vector() with an exception terminate called after throwing an instance of 'std::logic_error' what(): Trying to use an undefined type in a kernel. Abort trap: 6

Do you have any idea why ?

All other examples crash the same way.

Thanks :)

ddemidov commented 11 years ago

Hello,

I can not reproduce the issue on Linux/gcc-4.7.2/Boost-1.53, but I think it comes from one of recent commits. Could you please check if version 19acba25 works for you?

Quanteek commented 11 years ago

Not working either :/

I went back to 1f527894479beab45 (8th April) and it still not working. Maybe a problem with my libraries ?

On 1f527894479beab45 there is a problem at this line COMPOUND_ASSIGNMENT(+=, +); according to backtrace On master HEAD version (56e46f2c3687cb) here is the backtrace :

#0  0x00000001003864a8 in __cxa_throw ()
#1  0x0000000100013d29 in vex::type_name<unsigned long> () at util.hpp:67
#2  0x000000010001f054 in vex::vector<double>::assign_expression<vex::assign::ADD, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::plus, boost::proto::argsns_::list2<vex::vector<double>&, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector<double>&, vex::vector<double>&>, 2l> > >, 2l> > > (this=0x7fff5fbfecf0, expr=@0x7fff5fbff030) at vector.hpp:777
#3  0x0000000100014d3d in vex::vector<double>::operator+=<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::plus, boost::proto::argsns_::list2<vex::vector<double>&, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector<double>&, vex::vector<double>&>, 2l> > >, 2l> > > (this=0x7fff5fbfecf0, expr=@0x7fff5fbff030) at vector.hpp:552
#4  0x00000001000052b0 in benchmark_vector (queue=@0x7fff5fbff248, prof=@0x7fff5fbff260) at /Users/benoitbayol/Code/vexcl/examples/benchmark.cpp:51
#5  0x00000001000093b4 in main () at /Users/benoitbayol/Code/vexcl/examples/benchmark.cpp:502

Description : g++-mp-4.7 (MacPorts gcc47 4.7.3_0) 4.7.3 boost 1.53.0 opencl of snow leopard is in /System/Library/Framework/OpenCL.framework/ ./examples/devlist gives : OpenCL devices:

Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz CL_PLATFORM_NAME = Apple CL_DEVICE_MAX_COMPUTE_UNITS = 8 CL_DEVICE_HOST_UNIFIED_MEMORY = 1 CL_DEVICE_GLOBAL_MEM_SIZE = 8589934592 CL_DEVICE_LOCAL_MEM_SIZE = 32768 CL_DEVICE_MAX_MEM_ALLOC_SIZE = 2147483648 CL_DEVICE_MAX_CLOCK_FREQUENCY = 2000

ATI Radeon HD 6490M CL_PLATFORM_NAME = Apple CL_DEVICE_MAX_COMPUTE_UNITS = 2 CL_DEVICE_HOST_UNIFIED_MEMORY = 0 CL_DEVICE_GLOBAL_MEM_SIZE = 268435456 CL_DEVICE_LOCAL_MEM_SIZE = 32768 CL_DEVICE_MAX_MEM_ALLOC_SIZE = 134217728 CL_DEVICE_MAX_CLOCK_FREQUENCY = 750

ddemidov commented 11 years ago

Could you wrap the output into triple "`" to get verbatim block of text? Without it some angle brackets are lost. I am especially interested in what type is shown here:

#1 0x0000000100013d29 in vex::type_name<???> () at util.hpp:67
Quanteek commented 11 years ago

Done !

See previous comment :)

ddemidov commented 11 years ago

This seems to be same issue as in #14. Can you please check if 89e785d works for you?

Quanteek commented 11 years ago

It is working with 89e785d

Output on CPU is :

1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz

Vector arithmetic
  OpenCL
    GFLOPS:    1.15543
    Bandwidth: 15.4058
  C++
    GFLOPS:    1.17702
    Bandwidth: 15.6935
  res = 0

Reduction
  OpenCL
    GFLOPS:    2.22721
    Bandwidth: 17.8177
  C++
    GFLOPS:    1.47788
    Bandwidth: 11.8231
  res = 4.76837e-07

Stencil convolution
  OpenCL
    GFLOPS:    2.43398
    Bandwidth: 19.4718
  C++
    GFLOPS:    1.26781
    Bandwidth: 10.1425
  res = 0

SpMV
  OpenCL
    GFLOPS:    0.753692
    Bandwidth: 10.0877
  C++
    GFLOPS:    1.08501
    Bandwidth: 14.5223
  res = 0

SpMV (CCSR)
  OpenCL
    GFLOPS:    1.03288
    Bandwidth: 11.9018
  C++
    GFLOPS:    1.09799
    Bandwidth: 12.652
  res = 0

[Profile:               192.302 sec.] (100.00%)
[  Vector arithmetic:     6.057 sec.] (  3.15%)
[   self:                 0.533 sec.] (  0.28%)
[    OpenCL:              2.788 sec.] (  1.45%)
[    C++:                 2.737 sec.] (  1.42%)
[  Reduction:             3.214 sec.] (  1.67%)
[   self:                 0.797 sec.] (  0.41%)
[    OpenCL:              0.964 sec.] (  0.50%)
[    C++:                 1.453 sec.] (  0.76%)
[  Stencil:              54.327 sec.] ( 28.25%)
[   self:                 0.228 sec.] (  0.12%)
[    OpenCL:             18.528 sec.] (  9.63%)
[    C++:                35.571 sec.] ( 18.50%)
[  SpMV:                 70.266 sec.] ( 36.54%)
[   self:                 0.512 sec.] (  0.27%)
[    OpenCL:             41.161 sec.] ( 21.40%)
[    C++:                28.592 sec.] ( 14.87%)
[  SpMV (CCSR):          58.438 sec.] ( 30.39%)
[    OpenCL:             30.035 sec.] ( 15.62%)
[    C++:                28.254 sec.] ( 14.69%)

but for the graphic card I have

1. ATI Radeon HD 6490M

Vector arithmetic
  OpenCL
    GFLOPS:    2.39215
    Bandwidth: 31.8953
  C++
    GFLOPS:    1.21317
    Bandwidth: 16.1756
  res = nan

clEnqueueNDRangeKernel(Mem object allocation failure)

but it is not the same problem :)

ddemidov commented 11 years ago

I think your GPU does not have enough memory (is it 256 MB?). I've reduced vector sizes in benchmark in b9d53bc. Does it work now? Also, could you please run make test inside build directory?

Quanteek commented 11 years ago

Output of make test (coming from LastTest.log) :

Start testing: May 02 14:03 CEST
----------------------------------------------------------
1/11 Testing: vector_create
1/11 Test: vector_create
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/vector_create"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"vector_create" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496200
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz

Running 12 test cases...

*** No errors detected
<end of output>
Test time =   0.16 sec
----------------------------------------------------------
Test Passed.
"vector_create" end time: May 02 14:03 CEST
"vector_create" time elapsed: 00:00:00
----------------------------------------------------------

2/11 Testing: vector_copy
2/11 Test: vector_copy
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/vector_copy"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"vector_copy" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496200
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz

Running 7 test cases...

*** No errors detected
<end of output>
Test time =   0.14 sec
----------------------------------------------------------
Test Passed.
"vector_copy" end time: May 02 14:03 CEST
"vector_copy" time elapsed: 00:00:00
----------------------------------------------------------

3/11 Testing: vector_arithmetics
3/11 Test: vector_arithmetics
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/vector_arithmetics"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"vector_arithmetics" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496201
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz

Running 13 test cases...

*** No errors detected
<end of output>
Test time =   1.12 sec
----------------------------------------------------------
Test Passed.
"vector_arithmetics" end time: May 02 14:03 CEST
"vector_arithmetics" time elapsed: 00:00:01
----------------------------------------------------------

4/11 Testing: multivector_create
4/11 Test: multivector_create
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/multivector_create"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"multivector_create" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496202
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz

Running 4 test cases...

*** No errors detected
<end of output>
Test time =   0.08 sec
----------------------------------------------------------
Test Passed.
"multivector_create" end time: May 02 14:03 CEST
"multivector_create" time elapsed: 00:00:00
----------------------------------------------------------

5/11 Testing: multivector_arithmetics
5/11 Test: multivector_arithmetics
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/multivector_arithmetics"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"multivector_arithmetics" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496202
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz

Running 8 test cases...

*** No errors detected
<end of output>
Test time =   0.46 sec
----------------------------------------------------------
Test Passed.
"multivector_arithmetics" end time: May 02 14:03 CEST
"multivector_arithmetics" time elapsed: 00:00:00
----------------------------------------------------------

6/11 Testing: spmv
6/11 Test: spmv
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/spmv"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"spmv" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496202
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz

Running 8 test cases...

*** No errors detected
<end of output>
Test time =   0.46 sec
----------------------------------------------------------
Test Passed.
"spmv" end time: May 02 14:03 CEST
"spmv" time elapsed: 00:00:00
----------------------------------------------------------

7/11 Testing: stencil
7/11 Test: stencil
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/stencil"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"stencil" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496203
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz

Running 7 test cases...

*** No errors detected
<end of output>
Test time =   0.27 sec
----------------------------------------------------------
Test Passed.
"stencil" end time: May 02 14:03 CEST
"stencil" time elapsed: 00:00:00
----------------------------------------------------------

8/11 Testing: generator
8/11 Test: generator
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/generator"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"generator" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496203
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz

Running 3 test cases...

*** No errors detected
<end of output>
Test time =   0.19 sec
----------------------------------------------------------
Test Passed.
"generator" end time: May 02 14:03 CEST
"generator" time elapsed: 00:00:00
----------------------------------------------------------

9/11 Testing: random
9/11 Test: random
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/random"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"random" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496203
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz

Running 2 test cases...

*** No errors detected
<end of output>
Test time =   0.93 sec
----------------------------------------------------------
Test Passed.
"random" end time: May 02 14:03 CEST
"random" time elapsed: 00:00:00
----------------------------------------------------------

10/11 Testing: multiple_objects
10/11 Test: multiple_objects
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/multiple_objects"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"multiple_objects" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
Running 1 test case...
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
2. ATI Radeon HD 6490M

*** No errors detected
<end of output>
Test time =   0.05 sec
----------------------------------------------------------
Test Passed.
"multiple_objects" end time: May 02 14:03 CEST
"multiple_objects" time elapsed: 00:00:00
----------------------------------------------------------

11/11 Testing: fft
11/11 Test: fft
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/fft"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"fft" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496204
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz

Running 3 test cases...

*** No errors detected
<end of output>
Test time =   1.03 sec
----------------------------------------------------------
Test Passed.
"fft" end time: May 02 14:03 CEST
"fft" time elapsed: 00:00:01
----------------------------------------------------------

End testing: May 02 14:03 CEST

Output of ./examples/benchmark on b9d53bc :

1. ATI Radeon HD 6490M

Vector arithmetic
  OpenCL
    GFLOPS:    2.27259
    Bandwidth: 30.3011
  C++
    GFLOPS:    1.31794
    Bandwidth: 17.5725
  res = -5.78533e+307

Reduction
  OpenCL
    GFLOPS:    1.28116
    Bandwidth: 10.2493
  C++
    GFLOPS:    1.67311
    Bandwidth: 13.3849
  res = inf

#if defined(cl_khr_fp64)
#  pragma OPENCL EXTENSION cl_khr_fp64: enable
#elif defined(cl_amd_fp64)
#  pragma OPENCL EXTENSION cl_amd_fp64: enable
#endif
typedef double real;
real read_x(
    long g_id,
    ulong n,
    char has_left, char has_right,
    int lhalo, int rhalo,
    global const real *xloc,
    global const real *xrem
    )
{
    if (g_id >= 0 && g_id < n) {
        return xloc[g_id];
    } else if (g_id < 0) {
        if (has_left)
            return (lhalo + g_id >= 0) ? xrem[lhalo + g_id] : 0;
        else
            return xloc[0];
    } else {
        if (has_right)
            return (g_id < n + rhalo) ? xrem[lhalo + g_id - n] : 0;
        else
            return xloc[n - 1];
    }
}
kernel void slow_conv(
    ulong n,
    char has_left,
    char has_right,
    int lhalo, int rhalo,
    global const real *s,
    global const real *xloc,
    global const real *xrem,
    global real *y,
    real alpha, real beta,
    local real *loc_s,
    local real *loc_x
    )
{
    size_t grid_size = get_global_size(0);
    for(long g_id = get_global_id(0); g_id < n; g_id += grid_size) {
        real sum = 0;
        for(int j = -lhalo; j <= rhalo; j++)
            sum += s[lhalo + j] * read_x(g_id + j, n, has_left, has_right, lhalo, rhalo, xloc, xrem);
        if (alpha)
            y[g_id] = alpha * y[g_id] + beta * sum;
        else
            y[g_id] = beta * sum;
    }
}
kernel void fast_conv(
    ulong n,
    char has_left,
    char has_right,
    int lhalo, int rhalo,
    global const real *s,
    global const real *xloc,
    global const real *xrem,
    global real *y,
    real alpha, real beta,
    local real *S,
    local real *X
    )
{
    size_t grid_size = get_global_size(0);
    int l_id       = get_local_id(0);
    int block_size = get_local_size(0);
    async_work_group_copy(S, s, lhalo + rhalo + 1, 0);
    for(long g_id = get_global_id(0), pos = 0; pos < n; g_id += grid_size, pos += grid_size) {
        for(int i = l_id, j = g_id - lhalo; i < block_size + lhalo + rhalo; i += block_size, j += block_size)
            X[i] = read_x(j, n, has_left, has_right, lhalo, rhalo, xloc, xrem);
        barrier(CLK_LOCAL_MEM_FENCE);
        if (g_id < n) {
            real sum = 0;
            for(int j = -lhalo; j <= rhalo; j++)
                sum += S[lhalo + j] * X[lhalo + l_id + j];
            if (alpha)
                y[g_id] = alpha * y[g_id] + beta * sum;
            else
                y[g_id] = beta * sum;
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
}

Undeclared function '_Z21async_work_group_copyPU3AS3dPKU3AS1djj' called by function 'fast_conv'

clBuildProgram(Build program failure)

I changed benchmark.cpp this way :

--- a/examples/benchmark.cpp
+++ b/examples/benchmark.cpp
@@ -482,7 +482,8 @@ std::pair<double,double> benchmark_spmv_ccsr(
 //---------------------------------------------------------------------------
 int main() {
     try {
-        vex::Context ctx(Filter::DoublePrecision && Filter::Env);
+        //vex::Context ctx(Filter::DoublePrecision && Filter::Env);
+        vex::Context ctx(Filter::Name("ATI Radeon HD 6490M"));

         if (!ctx.size()) {
             std::cerr << "No compute devices found" << std::endl;
ddemidov commented 11 years ago

Regarding your change of benchmark.cpp: you may use environment variables to control what device gets selected (vex::Filter::Env handles these):

$ OCL_DEVICE=Intel ./benchmark
...
$ OCL_DEVICE=6490M ./benchmark
...

Regarding the failure: this looks like a hardware issue to me. async_work_group_copy is a standard OpenCL function that is defined in OpenCL standard v1.0.

Also I don't know what happens in the first two tests of the benchmark, especially since all the tests in make test seem to have passed. Have you had any success running some nontrivial OpenCL code on this GPU before?

ddemidov commented 11 years ago

Oh, you've run the tests for Intel device. What does

$ OCL_DEVICE=Radeon make test

give you?

ddemidov commented 11 years ago

Does the GPU support double precision? You have commented out the DoublePrecision filter. And all of the tests pick only Intel except for the multiple_objects test, which does not filter devices by double precision support.

Quanteek commented 11 years ago

With this patch :

 diff --git a/examples/devlist.cpp b/examples/devlist.cpp
index b0c6690..739ac89 100644
--- a/examples/devlist.cpp
+++ b/examples/devlist.cpp
@@ -14,6 +14,7 @@ int main() {
                   << "    CL_DEVICE_LOCAL_MEM_SIZE      = " << d->getInfo<CL_DEVICE_LOCAL_MEM_SIZE>() << std::endl
                   << "    CL_DEVICE_MAX_MEM_ALLOC_SIZE  = " << d->getInfo<CL_DEVICE_MAX_MEM_ALLOC_SIZE>() << std::endl
                   << "    CL_DEVICE_MAX_CLOCK_FREQUENCY = " << d->getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>() << std::endl
+                  << "    CL_DEVICE_EXTENSIONS = " << d->getInfo<CL_DEVICE_EXTENSIONS>() << std::endl
                   << std::endl;
     }
 }

I have :

OpenCL devices:

  Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
    CL_PLATFORM_NAME              = Apple
    CL_DEVICE_MAX_COMPUTE_UNITS   = 8
    CL_DEVICE_HOST_UNIFIED_MEMORY = 1
    CL_DEVICE_GLOBAL_MEM_SIZE     = 8589934592
    CL_DEVICE_LOCAL_MEM_SIZE      = 32768
    CL_DEVICE_MAX_MEM_ALLOC_SIZE  = 2147483648
    CL_DEVICE_MAX_CLOCK_FREQUENCY = 2000
    CL_DEVICE_EXTENSIONS = cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_APPLE_fp64_basic_ops cl_APPLE_fixed_alpha_channel_orders 

  ATI Radeon HD 6490M
    CL_PLATFORM_NAME              = Apple
    CL_DEVICE_MAX_COMPUTE_UNITS   = 2
    CL_DEVICE_HOST_UNIFIED_MEMORY = 0
    CL_DEVICE_GLOBAL_MEM_SIZE     = 268435456
    CL_DEVICE_LOCAL_MEM_SIZE      = 32768
    CL_DEVICE_MAX_MEM_ALLOC_SIZE  = 134217728
    CL_DEVICE_MAX_CLOCK_FREQUENCY = 750
    CL_DEVICE_EXTENSIONS = cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store

So my device does not support fp64 :/ Problem solved since it is hardware. I am going to check on an other card.

ddemidov commented 11 years ago

Ok, at least MacOS/gcc combination works now! Thanks for reporting this :)