Closed ds283 closed 11 years ago
This does seem implementation specific: I can not reproduce the error on AMD or Intel OpenCL SDKs. The size 5 looks a bit odd to me. There possibly could be some alignment problems with it. Could you try and replace it with something nice like 16 or 1024? If that helps, may be I should do this automatically in vexcl.
Thanks for this. I'm pretty sure it is an implementation problem.
Changing the size of the multivector doesn't seem to make any difference. However, I have tried the same code with 10.8 and 10.9 under virtualization and the error disappears. I will try to take it up with Apple.
Just to be sure, does the following code produces same error?
#define VEXCL_SHOW_KERNELS
#include "vexcl/vexcl.hpp"
int main() {
vex::Context ctx(vex::Filter::Type(CL_DEVICE_TYPE_CPU) && vex::Filter::DoublePrecision);
vex::vector<double> u2(ctx, 5);
u2 = 0.0;
}
This is what happens in the background of the original example. It should give same error if the OpenCL implementation is at fault. I can also provide a test case that launches same kernel with plain OpenCL API. It should be easier to reason about with Apple support.
Here is plain API version of the same program: https://gist.github.com/ddemidov/7353476. May be compiled with g++ -std=c++0x -o zero zero.cpp -lOpenCL
.
Yes, I get the same error from the vex::vector<> case. Thank you very much for providing the plain API version: I will pass this on to Apple and see what happens. It crashes with the excpetion 'clEnqueueNDRangeKernel -54', which looks like CL_INVALID_WORKGROUP_SIZE.
I probably asked for a too large workgroup. Can you try with the updated gist?
I get the same result with 256
Thanks very much for all this.
This version gives the same error. It reports:
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
max workgroup size = 1024
clEnqueueNDRangeKernel -54
That's strange. The latest version tries to get workgroup size for the specific kernel. What does it output?
Curious. This time I get
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
max workgroup size = 128
Bus error: 10
Running it on the GPU gives
GeForce GT 650M
clBuildProgram -11
Does the latest commit compile and run? I've changed the loop inside the kernel.
I also output compile errors in the latest version. What do you get with GT 650M now?
I am sorry about bothering you; I don't have a Mac to test these on.
Another thought: I assumed that you are on a 64bit system here. Was I correct in this assumption?
You are certainly not bothering me! This is immensely helpful - thank you for your time.
On the CPU I get
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
max workgroup size = 128
0
and on the GPU I get
GeForce GT 650M
max workgroup size = 1024
0
I have also tried this version on a different machine (different CPU and GPU), and it works there too.
It seems that Apple's OpenCL has problems with the way I organize loops inside compute kernels. 27c7d86 tests if this is the case. Does the original example work with this commit?
f556873 is another attempt to resolve the issue (http://wiki.tiker.net/OpenCLOddities). Does it work on a CPU with MacOSX?
Thanks for this. f556873 and 27c7d86 both seem to work on the CPU, but 27c7d86 seems to be marginally faster.
How did you measure it? What size did you test? Did you #define VEXCL_CACHE_KERNELS
to factor out kernel compile time?
f556873 is much faster on my CPU (Intel Q9400) than 27c7d86: its 5 sec vs 28 sec. The test I used is lorenz attractor parameter study with ensemble of size 1024.
Also, f556873 is the more right thing to do.
My testing was not very systematic. I was using my own code, with boost auto_cpu_timer to obtain timings. For that, with VEXCL_CACHE_KERNELS
defined, I find:
27c7d86
1. Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (Apple)
1187.641217s wall, 1913.160000s user + 2796.100000s system = 4709.260000s CPU (396.5%)
f556873
1. Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (Apple)
1197.468915s wall, 1974.720000s user + 2758.160000s system = 4732.880000s CPU (395.2%)
For the Lorenz ensemble, I find:
27c7d86
1. Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (Apple)
0.984591s wall, 3.040000s user + 2.550000s system = 5.590000s CPU (567.7%)
f556873
1. Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (Apple)
1.372676s wall, 3.470000s user + 3.240000s system = 6.710000s CPU (488.8%)
Out of interest, what was the problem size in you code (size of vectors/multivectors you used)?
It is a 20-equation system of ODEs, and uses a vex::multivector<double, 20> to generate an ensemble of these for different parameter choices. These runs were for 5 different parameters (that's where the weird length 5 came from).
Is this typical ensemble size for your problems? Because I think you would get much higher performance for the sizes as small, if you just used plain CPU code. I am afraid that now you spend most of your time in OpenCL API calls, and not doing anything useful. Switching from VexCL to e.g. Eigen should be easy enough, because Eigen uses expression templates as well, and should give you noticeable performance boost.
Eventually it will need to run for at an ensemble of at least 1E5 or 1E6 parameters. At the moment I am using smaller sizes while I debug the implementation.
At the moment it is much faster on the CPU - using OpenMP directly, it takes just O(second). We're developing multiple versions to help decide what the best computational strategy will be with more realistic ensemble sizes.
Ok, that makes sense.
Well, the master branch should work with MacOSX now, thank you very much for reporting the issue, and for your help with testing.
While trying to debug some VexCL code on the CPU, I notice that writing to a vex::multivector<> generates an exception, EXC_BAD_ACCESS (code=2).
For example, the snippet
generates a suitable-looking kernel, but crashes with an exception when the write to u2 is performed. But running it on a GPU device works as expected. This is on OS X 10.9.
Because it works on the GPU this has the appearance of an implementation bug with the OpenCL shipped with 10.9. Unfortunately I don't have access to a different one which would enable me to check. Is there any way to debug such issues, or is the only realistic approach to report it to Apple and hope for the best?