Closed Quanteek closed 11 years ago
Hello,
I can not reproduce the issue on Linux/gcc-4.7.2/Boost-1.53, but I think it comes from one of recent commits. Could you please check if version 19acba25 works for you?
Not working either :/
I went back to 1f527894479beab45 (8th April) and it still not working. Maybe a problem with my libraries ?
On 1f527894479beab45 there is a problem at this line COMPOUND_ASSIGNMENT(+=, +); according to backtrace On master HEAD version (56e46f2c3687cb) here is the backtrace :
#0 0x00000001003864a8 in __cxa_throw ()
#1 0x0000000100013d29 in vex::type_name<unsigned long> () at util.hpp:67
#2 0x000000010001f054 in vex::vector<double>::assign_expression<vex::assign::ADD, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::plus, boost::proto::argsns_::list2<vex::vector<double>&, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector<double>&, vex::vector<double>&>, 2l> > >, 2l> > > (this=0x7fff5fbfecf0, expr=@0x7fff5fbff030) at vector.hpp:777
#3 0x0000000100014d3d in vex::vector<double>::operator+=<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::plus, boost::proto::argsns_::list2<vex::vector<double>&, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector<double>&, vex::vector<double>&>, 2l> > >, 2l> > > (this=0x7fff5fbfecf0, expr=@0x7fff5fbff030) at vector.hpp:552
#4 0x00000001000052b0 in benchmark_vector (queue=@0x7fff5fbff248, prof=@0x7fff5fbff260) at /Users/benoitbayol/Code/vexcl/examples/benchmark.cpp:51
#5 0x00000001000093b4 in main () at /Users/benoitbayol/Code/vexcl/examples/benchmark.cpp:502
Description : g++-mp-4.7 (MacPorts gcc47 4.7.3_0) 4.7.3 boost 1.53.0 opencl of snow leopard is in /System/Library/Framework/OpenCL.framework/ ./examples/devlist gives : OpenCL devices:
Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz CL_PLATFORM_NAME = Apple CL_DEVICE_MAX_COMPUTE_UNITS = 8 CL_DEVICE_HOST_UNIFIED_MEMORY = 1 CL_DEVICE_GLOBAL_MEM_SIZE = 8589934592 CL_DEVICE_LOCAL_MEM_SIZE = 32768 CL_DEVICE_MAX_MEM_ALLOC_SIZE = 2147483648 CL_DEVICE_MAX_CLOCK_FREQUENCY = 2000
ATI Radeon HD 6490M CL_PLATFORM_NAME = Apple CL_DEVICE_MAX_COMPUTE_UNITS = 2 CL_DEVICE_HOST_UNIFIED_MEMORY = 0 CL_DEVICE_GLOBAL_MEM_SIZE = 268435456 CL_DEVICE_LOCAL_MEM_SIZE = 32768 CL_DEVICE_MAX_MEM_ALLOC_SIZE = 134217728 CL_DEVICE_MAX_CLOCK_FREQUENCY = 750
Could you wrap the output into triple "`" to get verbatim block of text? Without it some angle brackets are lost. I am especially interested in what type is shown here:
#1 0x0000000100013d29 in vex::type_name<???> () at util.hpp:67
Done !
See previous comment :)
This seems to be same issue as in #14. Can you please check if 89e785d works for you?
It is working with 89e785d
Output on CPU is :
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
Vector arithmetic
OpenCL
GFLOPS: 1.15543
Bandwidth: 15.4058
C++
GFLOPS: 1.17702
Bandwidth: 15.6935
res = 0
Reduction
OpenCL
GFLOPS: 2.22721
Bandwidth: 17.8177
C++
GFLOPS: 1.47788
Bandwidth: 11.8231
res = 4.76837e-07
Stencil convolution
OpenCL
GFLOPS: 2.43398
Bandwidth: 19.4718
C++
GFLOPS: 1.26781
Bandwidth: 10.1425
res = 0
SpMV
OpenCL
GFLOPS: 0.753692
Bandwidth: 10.0877
C++
GFLOPS: 1.08501
Bandwidth: 14.5223
res = 0
SpMV (CCSR)
OpenCL
GFLOPS: 1.03288
Bandwidth: 11.9018
C++
GFLOPS: 1.09799
Bandwidth: 12.652
res = 0
[Profile: 192.302 sec.] (100.00%)
[ Vector arithmetic: 6.057 sec.] ( 3.15%)
[ self: 0.533 sec.] ( 0.28%)
[ OpenCL: 2.788 sec.] ( 1.45%)
[ C++: 2.737 sec.] ( 1.42%)
[ Reduction: 3.214 sec.] ( 1.67%)
[ self: 0.797 sec.] ( 0.41%)
[ OpenCL: 0.964 sec.] ( 0.50%)
[ C++: 1.453 sec.] ( 0.76%)
[ Stencil: 54.327 sec.] ( 28.25%)
[ self: 0.228 sec.] ( 0.12%)
[ OpenCL: 18.528 sec.] ( 9.63%)
[ C++: 35.571 sec.] ( 18.50%)
[ SpMV: 70.266 sec.] ( 36.54%)
[ self: 0.512 sec.] ( 0.27%)
[ OpenCL: 41.161 sec.] ( 21.40%)
[ C++: 28.592 sec.] ( 14.87%)
[ SpMV (CCSR): 58.438 sec.] ( 30.39%)
[ OpenCL: 30.035 sec.] ( 15.62%)
[ C++: 28.254 sec.] ( 14.69%)
but for the graphic card I have
1. ATI Radeon HD 6490M
Vector arithmetic
OpenCL
GFLOPS: 2.39215
Bandwidth: 31.8953
C++
GFLOPS: 1.21317
Bandwidth: 16.1756
res = nan
clEnqueueNDRangeKernel(Mem object allocation failure)
but it is not the same problem :)
I think your GPU does not have enough memory (is it 256 MB?). I've reduced vector sizes in benchmark in b9d53bc. Does it work now? Also, could you please run make test
inside build directory?
Output of make test (coming from LastTest.log) :
Start testing: May 02 14:03 CEST
----------------------------------------------------------
1/11 Testing: vector_create
1/11 Test: vector_create
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/vector_create"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"vector_create" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496200
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
Running 12 test cases...
*** No errors detected
<end of output>
Test time = 0.16 sec
----------------------------------------------------------
Test Passed.
"vector_create" end time: May 02 14:03 CEST
"vector_create" time elapsed: 00:00:00
----------------------------------------------------------
2/11 Testing: vector_copy
2/11 Test: vector_copy
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/vector_copy"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"vector_copy" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496200
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
Running 7 test cases...
*** No errors detected
<end of output>
Test time = 0.14 sec
----------------------------------------------------------
Test Passed.
"vector_copy" end time: May 02 14:03 CEST
"vector_copy" time elapsed: 00:00:00
----------------------------------------------------------
3/11 Testing: vector_arithmetics
3/11 Test: vector_arithmetics
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/vector_arithmetics"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"vector_arithmetics" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496201
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
Running 13 test cases...
*** No errors detected
<end of output>
Test time = 1.12 sec
----------------------------------------------------------
Test Passed.
"vector_arithmetics" end time: May 02 14:03 CEST
"vector_arithmetics" time elapsed: 00:00:01
----------------------------------------------------------
4/11 Testing: multivector_create
4/11 Test: multivector_create
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/multivector_create"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"multivector_create" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496202
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
Running 4 test cases...
*** No errors detected
<end of output>
Test time = 0.08 sec
----------------------------------------------------------
Test Passed.
"multivector_create" end time: May 02 14:03 CEST
"multivector_create" time elapsed: 00:00:00
----------------------------------------------------------
5/11 Testing: multivector_arithmetics
5/11 Test: multivector_arithmetics
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/multivector_arithmetics"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"multivector_arithmetics" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496202
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
Running 8 test cases...
*** No errors detected
<end of output>
Test time = 0.46 sec
----------------------------------------------------------
Test Passed.
"multivector_arithmetics" end time: May 02 14:03 CEST
"multivector_arithmetics" time elapsed: 00:00:00
----------------------------------------------------------
6/11 Testing: spmv
6/11 Test: spmv
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/spmv"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"spmv" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496202
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
Running 8 test cases...
*** No errors detected
<end of output>
Test time = 0.46 sec
----------------------------------------------------------
Test Passed.
"spmv" end time: May 02 14:03 CEST
"spmv" time elapsed: 00:00:00
----------------------------------------------------------
7/11 Testing: stencil
7/11 Test: stencil
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/stencil"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"stencil" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496203
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
Running 7 test cases...
*** No errors detected
<end of output>
Test time = 0.27 sec
----------------------------------------------------------
Test Passed.
"stencil" end time: May 02 14:03 CEST
"stencil" time elapsed: 00:00:00
----------------------------------------------------------
8/11 Testing: generator
8/11 Test: generator
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/generator"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"generator" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496203
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
Running 3 test cases...
*** No errors detected
<end of output>
Test time = 0.19 sec
----------------------------------------------------------
Test Passed.
"generator" end time: May 02 14:03 CEST
"generator" time elapsed: 00:00:00
----------------------------------------------------------
9/11 Testing: random
9/11 Test: random
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/random"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"random" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496203
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
Running 2 test cases...
*** No errors detected
<end of output>
Test time = 0.93 sec
----------------------------------------------------------
Test Passed.
"random" end time: May 02 14:03 CEST
"random" time elapsed: 00:00:00
----------------------------------------------------------
10/11 Testing: multiple_objects
10/11 Test: multiple_objects
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/multiple_objects"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"multiple_objects" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
Running 1 test case...
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
2. ATI Radeon HD 6490M
*** No errors detected
<end of output>
Test time = 0.05 sec
----------------------------------------------------------
Test Passed.
"multiple_objects" end time: May 02 14:03 CEST
"multiple_objects" time elapsed: 00:00:00
----------------------------------------------------------
11/11 Testing: fft
11/11 Test: fft
Command: "/Users/benoitbayol/Code/vexcl/pr/tests/fft"
Directory: /Users/benoitbayol/Code/vexcl/pr/tests
"fft" start time: May 02 14:03 CEST
Output:
----------------------------------------------------------
seed: 1367496204
1. Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
Running 3 test cases...
*** No errors detected
<end of output>
Test time = 1.03 sec
----------------------------------------------------------
Test Passed.
"fft" end time: May 02 14:03 CEST
"fft" time elapsed: 00:00:01
----------------------------------------------------------
End testing: May 02 14:03 CEST
Output of ./examples/benchmark on b9d53bc :
1. ATI Radeon HD 6490M
Vector arithmetic
OpenCL
GFLOPS: 2.27259
Bandwidth: 30.3011
C++
GFLOPS: 1.31794
Bandwidth: 17.5725
res = -5.78533e+307
Reduction
OpenCL
GFLOPS: 1.28116
Bandwidth: 10.2493
C++
GFLOPS: 1.67311
Bandwidth: 13.3849
res = inf
#if defined(cl_khr_fp64)
# pragma OPENCL EXTENSION cl_khr_fp64: enable
#elif defined(cl_amd_fp64)
# pragma OPENCL EXTENSION cl_amd_fp64: enable
#endif
typedef double real;
real read_x(
long g_id,
ulong n,
char has_left, char has_right,
int lhalo, int rhalo,
global const real *xloc,
global const real *xrem
)
{
if (g_id >= 0 && g_id < n) {
return xloc[g_id];
} else if (g_id < 0) {
if (has_left)
return (lhalo + g_id >= 0) ? xrem[lhalo + g_id] : 0;
else
return xloc[0];
} else {
if (has_right)
return (g_id < n + rhalo) ? xrem[lhalo + g_id - n] : 0;
else
return xloc[n - 1];
}
}
kernel void slow_conv(
ulong n,
char has_left,
char has_right,
int lhalo, int rhalo,
global const real *s,
global const real *xloc,
global const real *xrem,
global real *y,
real alpha, real beta,
local real *loc_s,
local real *loc_x
)
{
size_t grid_size = get_global_size(0);
for(long g_id = get_global_id(0); g_id < n; g_id += grid_size) {
real sum = 0;
for(int j = -lhalo; j <= rhalo; j++)
sum += s[lhalo + j] * read_x(g_id + j, n, has_left, has_right, lhalo, rhalo, xloc, xrem);
if (alpha)
y[g_id] = alpha * y[g_id] + beta * sum;
else
y[g_id] = beta * sum;
}
}
kernel void fast_conv(
ulong n,
char has_left,
char has_right,
int lhalo, int rhalo,
global const real *s,
global const real *xloc,
global const real *xrem,
global real *y,
real alpha, real beta,
local real *S,
local real *X
)
{
size_t grid_size = get_global_size(0);
int l_id = get_local_id(0);
int block_size = get_local_size(0);
async_work_group_copy(S, s, lhalo + rhalo + 1, 0);
for(long g_id = get_global_id(0), pos = 0; pos < n; g_id += grid_size, pos += grid_size) {
for(int i = l_id, j = g_id - lhalo; i < block_size + lhalo + rhalo; i += block_size, j += block_size)
X[i] = read_x(j, n, has_left, has_right, lhalo, rhalo, xloc, xrem);
barrier(CLK_LOCAL_MEM_FENCE);
if (g_id < n) {
real sum = 0;
for(int j = -lhalo; j <= rhalo; j++)
sum += S[lhalo + j] * X[lhalo + l_id + j];
if (alpha)
y[g_id] = alpha * y[g_id] + beta * sum;
else
y[g_id] = beta * sum;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
}
Undeclared function '_Z21async_work_group_copyPU3AS3dPKU3AS1djj' called by function 'fast_conv'
clBuildProgram(Build program failure)
I changed benchmark.cpp this way :
--- a/examples/benchmark.cpp
+++ b/examples/benchmark.cpp
@@ -482,7 +482,8 @@ std::pair<double,double> benchmark_spmv_ccsr(
//---------------------------------------------------------------------------
int main() {
try {
- vex::Context ctx(Filter::DoublePrecision && Filter::Env);
+ //vex::Context ctx(Filter::DoublePrecision && Filter::Env);
+ vex::Context ctx(Filter::Name("ATI Radeon HD 6490M"));
if (!ctx.size()) {
std::cerr << "No compute devices found" << std::endl;
Regarding your change of benchmark.cpp: you may use environment variables to control what device gets selected (vex::Filter::Env handles these):
$ OCL_DEVICE=Intel ./benchmark
...
$ OCL_DEVICE=6490M ./benchmark
...
Regarding the failure: this looks like a hardware issue to me. async_work_group_copy is a standard OpenCL function that is defined in OpenCL standard v1.0.
Also I don't know what happens in the first two tests of the benchmark, especially since all the tests in make test
seem to have passed. Have you had any success running some nontrivial OpenCL code on this GPU before?
Oh, you've run the tests for Intel device. What does
$ OCL_DEVICE=Radeon make test
give you?
Does the GPU support double precision? You have commented out the DoublePrecision filter. And all of the tests pick only Intel except for the multiple_objects test, which does not filter devices by double precision support.
With this patch :
diff --git a/examples/devlist.cpp b/examples/devlist.cpp
index b0c6690..739ac89 100644
--- a/examples/devlist.cpp
+++ b/examples/devlist.cpp
@@ -14,6 +14,7 @@ int main() {
<< " CL_DEVICE_LOCAL_MEM_SIZE = " << d->getInfo<CL_DEVICE_LOCAL_MEM_SIZE>() << std::endl
<< " CL_DEVICE_MAX_MEM_ALLOC_SIZE = " << d->getInfo<CL_DEVICE_MAX_MEM_ALLOC_SIZE>() << std::endl
<< " CL_DEVICE_MAX_CLOCK_FREQUENCY = " << d->getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>() << std::endl
+ << " CL_DEVICE_EXTENSIONS = " << d->getInfo<CL_DEVICE_EXTENSIONS>() << std::endl
<< std::endl;
}
}
I have :
OpenCL devices:
Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
CL_PLATFORM_NAME = Apple
CL_DEVICE_MAX_COMPUTE_UNITS = 8
CL_DEVICE_HOST_UNIFIED_MEMORY = 1
CL_DEVICE_GLOBAL_MEM_SIZE = 8589934592
CL_DEVICE_LOCAL_MEM_SIZE = 32768
CL_DEVICE_MAX_MEM_ALLOC_SIZE = 2147483648
CL_DEVICE_MAX_CLOCK_FREQUENCY = 2000
CL_DEVICE_EXTENSIONS = cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_APPLE_fp64_basic_ops cl_APPLE_fixed_alpha_channel_orders
ATI Radeon HD 6490M
CL_PLATFORM_NAME = Apple
CL_DEVICE_MAX_COMPUTE_UNITS = 2
CL_DEVICE_HOST_UNIFIED_MEMORY = 0
CL_DEVICE_GLOBAL_MEM_SIZE = 268435456
CL_DEVICE_LOCAL_MEM_SIZE = 32768
CL_DEVICE_MAX_MEM_ALLOC_SIZE = 134217728
CL_DEVICE_MAX_CLOCK_FREQUENCY = 750
CL_DEVICE_EXTENSIONS = cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store
So my device does not support fp64 :/ Problem solved since it is hardware. I am going to check on an other card.
Ok, at least MacOS/gcc combination works now! Thanks for reporting this :)
Hello
I am trying to run Vexcl on a Mac Book Pro with Snow Leopard.
I compiled it with g++-4.7 and boost 1.53.
I can run the examples/devlist with success but if I run examples/benchmark I crash immediately on the line a += b + c * d; of benchmark_vector() with an exception terminate called after throwing an instance of 'std::logic_error' what(): Trying to use an undefined type in a kernel. Abort trap: 6
Do you have any idea why ?
All other examples crash the same way.
Thanks :)