OpenCL and Random Projection

louisdeb commented 6 years ago

I'm having trouble with my implementation of kernel code to speed up the random projection puzzle.

Having put the contents of MakeProjection in the kernel (including for loops), I managed to obtain the correct output, with n=200.

However this implementation is slow. I'm not sure how to address this issue here, so I'll address another one.

When running the puzzle with n>500 (just a test value), the program aborts. I believe this is due to the buffer size, 4*n*n, exceeding the maximum buffer size. Our CL library doesn't contain a call to GetDeviceInfo, so I can't confirm this. I haven't seen others getting this issue, so I assume my implementation is wrong. My reason for using the buffer is to pass a reference to m, aka proj, to the kernel.

Can someone help me out?

nicholzk commented 6 years ago

Have you tried your tests on your local machine and on AWS? I face some issues with my local machine that do not occur on the AWS machine, which I suspect might be due to the memory.

m8pple commented 6 years ago

For a buffer size of 4*n*n I would had though there would be no problem with with n of about 500, as that is only a 1M buffer. Most GPUs should easily be able to handle that, though as @nicholzk says, local machines may have flakier OpenCL implementations than the AWS servers.

It may be some sort of buffer overrun that is only becoming apparent as the program gets bigger, so the kernel is accessing memory off the end of the buffer. You could try over-allocating the buffer sizes (e.g. 4*(n+50)*(n+50)) and see if that helps. If so, then you've probably got a problem there.

You could also try using printf from within the kernel to see if it is actually executing, and if so where it stops - that might help diagnose how far it is getting. Similarly, adding logging to try identify which OpenCL function call can help (e.g. is it the mem copies or the kernel). You might want to add clFinish before each logging call, to ensure the queued action has actually happened (though remember to take it out once the problem is fixed).

It's possible it is a software bug too, so don't forget to try running under a normal software debugger to see if it can identify the crash point. For example, under AWS you could run it under gdb. So turn on debug symbols by adding -g to the compilation flags, then do:

gdb --args bin/run_puzzle random_projection 500 5

and use the command run to run it. If it crashes with an abort, then you can use bt to look at the stack.

louisdeb commented 6 years ago

Thanks @nicholzk, @m8pple. I have yet to try running it on AWS. I heard that there's issues with GPUs on AWS at the moment? I will try tonight.

turn on debug symbols by adding -g to the compilation flags

I believe it already is turned on, line 3 of the makefile is CPPFLAGS += -std=c++11 -W -Wall -g.

I've noticed that the abort occurs when running on my GPU, but not my CPU. gdb is running the code on the CPU, I assume, since it's not aborting. This means I'm not able to inspect the abort using gdb. How can I set the select device flag within gdb?

Furthermore, allocated buffer size of 4*(n+50)*(n+50) results in an exception in clEnqueueWriteBuffer (which is using the same buffer size). Printing within the kernel does not seem to reach y=1 before the abort occurs. Hm.

Any tips would be great.

m8pple commented 6 years ago

I'm not aware of any particular problems with running on GPUs right now. The only issue is that about 6% of implementations seem to freeze the AWS instance somehow.

I'm afraid I don't have too many suggestions available at this stage, apart from to very carefully check each buffer size and transfer. If it is not executing the kernel, try commenting out the call to clEnqueueKernel. Then try commenting out the call to clMemEnqueueCopy (or whatever it is called, I forget). Keep trying to simplify and comment out things until it stops crashing. Then look very carefully at the last thing you took out.

louisdeb commented 6 years ago

Thanks for your help. I think it was just down to poor implementation, and misunderstanding of kernel code.

HPCE / hpce-2017-cw5

OpenCL and Random Projection #48