ddemidov / vexcl

VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP
http://vexcl.readthedocs.org
MIT License
702 stars 82 forks source link

writing to vex::multivector<> on CPU raises exception EXC_BAD_ACCESS on OS X 10.9 #68

Closed ds283 closed 11 years ago

ds283 commented 11 years ago

While trying to debug some VexCL code on the CPU, I notice that writing to a vex::multivector<> generates an exception, EXC_BAD_ACCESS (code=2).

For example, the snippet

#define VEXCL_SHOW_KERNELS
#include "vexcl/vexcl.hpp"

int main() {
   vex::Context ctx(vex::Filter::Type(CL_DEVICE_TYPE_CPU) && vex::Filter::DoublePrecision);

   vex::multivector<double, 16> u2(ctx.queue(), 5);   
   u2(0) = 0.0;
}

generates a suitable-looking kernel, but crashes with an exception when the write to u2 is performed. But running it on a GPU device works as expected. This is on OS X 10.9.

Because it works on the GPU this has the appearance of an implementation bug with the OpenCL shipped with 10.9. Unfortunately I don't have access to a different one which would enable me to check. Is there any way to debug such issues, or is the only realistic approach to report it to Apple and hope for the best?

ddemidov commented 11 years ago

This does seem implementation specific: I can not reproduce the error on AMD or Intel OpenCL SDKs. The size 5 looks a bit odd to me. There possibly could be some alignment problems with it. Could you try and replace it with something nice like 16 or 1024? If that helps, may be I should do this automatically in vexcl.

ds283 commented 11 years ago

Thanks for this. I'm pretty sure it is an implementation problem.

Changing the size of the multivector doesn't seem to make any difference. However, I have tried the same code with 10.8 and 10.9 under virtualization and the error disappears. I will try to take it up with Apple.

ddemidov commented 11 years ago

Just to be sure, does the following code produces same error?

#define VEXCL_SHOW_KERNELS
#include "vexcl/vexcl.hpp"

int main() {
   vex::Context ctx(vex::Filter::Type(CL_DEVICE_TYPE_CPU) && vex::Filter::DoublePrecision);

   vex::vector<double> u2(ctx, 5);   
   u2 = 0.0;
}

This is what happens in the background of the original example. It should give same error if the OpenCL implementation is at fault. I can also provide a test case that launches same kernel with plain OpenCL API. It should be easier to reason about with Apple support.

ddemidov commented 11 years ago

Here is plain API version of the same program: https://gist.github.com/ddemidov/7353476. May be compiled with g++ -std=c++0x -o zero zero.cpp -lOpenCL.

ds283 commented 11 years ago

Yes, I get the same error from the vex::vector<> case. Thank you very much for providing the plain API version: I will pass this on to Apple and see what happens. It crashes with the excpetion 'clEnqueueNDRangeKernel -54', which looks like CL_INVALID_WORKGROUP_SIZE.

ddemidov commented 11 years ago

I probably asked for a too large workgroup. Can you try with the updated gist?

ds283 commented 11 years ago

I get the same result with 256

ddemidov commented 11 years ago

Final try: I am now asking the API for the workgroup size.

ds283 commented 11 years ago

Thanks very much for all this.

This version gives the same error. It reports:

Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
max workgroup size = 1024
clEnqueueNDRangeKernel -54
ddemidov commented 11 years ago

That's strange. The latest version tries to get workgroup size for the specific kernel. What does it output?

ds283 commented 11 years ago

Curious. This time I get

Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
max workgroup size = 128
Bus error: 10

Running it on the GPU gives

GeForce GT 650M
clBuildProgram -11
ddemidov commented 11 years ago

Does the latest commit compile and run? I've changed the loop inside the kernel.

ddemidov commented 11 years ago

I also output compile errors in the latest version. What do you get with GT 650M now?

I am sorry about bothering you; I don't have a Mac to test these on.

ddemidov commented 11 years ago

Another thought: I assumed that you are on a 64bit system here. Was I correct in this assumption?

ds283 commented 11 years ago

You are certainly not bothering me! This is immensely helpful - thank you for your time.

On the CPU I get

Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
max workgroup size = 128
0

and on the GPU I get

GeForce GT 650M
max workgroup size = 1024
0

I have also tried this version on a different machine (different CPU and GPU), and it works there too.

ddemidov commented 11 years ago

It seems that Apple's OpenCL has problems with the way I organize loops inside compute kernels. 27c7d86 tests if this is the case. Does the original example work with this commit?

ddemidov commented 11 years ago

f556873 is another attempt to resolve the issue (http://wiki.tiker.net/OpenCLOddities). Does it work on a CPU with MacOSX?

ds283 commented 11 years ago

Thanks for this. f556873 and 27c7d86 both seem to work on the CPU, but 27c7d86 seems to be marginally faster.

ddemidov commented 11 years ago

How did you measure it? What size did you test? Did you #define VEXCL_CACHE_KERNELS to factor out kernel compile time?

f556873 is much faster on my CPU (Intel Q9400) than 27c7d86: its 5 sec vs 28 sec. The test I used is lorenz attractor parameter study with ensemble of size 1024.

Also, f556873 is the more right thing to do.

ds283 commented 11 years ago

My testing was not very systematic. I was using my own code, with boost auto_cpu_timer to obtain timings. For that, with VEXCL_CACHE_KERNELS defined, I find:

27c7d86

1. Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (Apple)
 1187.641217s wall, 1913.160000s user + 2796.100000s system = 4709.260000s CPU (396.5%)

f556873

1. Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (Apple)
 1197.468915s wall, 1974.720000s user + 2758.160000s system = 4732.880000s CPU (395.2%)

For the Lorenz ensemble, I find:

27c7d86

1. Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (Apple)
 0.984591s wall, 3.040000s user + 2.550000s system = 5.590000s CPU (567.7%)

f556873

1. Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (Apple)
 1.372676s wall, 3.470000s user + 3.240000s system = 6.710000s CPU (488.8%)
ddemidov commented 11 years ago

Out of interest, what was the problem size in you code (size of vectors/multivectors you used)?

ds283 commented 11 years ago

It is a 20-equation system of ODEs, and uses a vex::multivector<double, 20> to generate an ensemble of these for different parameter choices. These runs were for 5 different parameters (that's where the weird length 5 came from).

ddemidov commented 11 years ago

Is this typical ensemble size for your problems? Because I think you would get much higher performance for the sizes as small, if you just used plain CPU code. I am afraid that now you spend most of your time in OpenCL API calls, and not doing anything useful. Switching from VexCL to e.g. Eigen should be easy enough, because Eigen uses expression templates as well, and should give you noticeable performance boost.

ds283 commented 11 years ago

Eventually it will need to run for at an ensemble of at least 1E5 or 1E6 parameters. At the moment I am using smaller sizes while I debug the implementation.

At the moment it is much faster on the CPU - using OpenMP directly, it takes just O(second). We're developing multiple versions to help decide what the best computational strategy will be with more realistic ensemble sizes.

ddemidov commented 11 years ago

Ok, that makes sense.

Well, the master branch should work with MacOSX now, thank you very much for reporting the issue, and for your help with testing.