Closed jkn93 closed 8 years ago
so you have 4 context-> 4 queues -> 4 devices.
why don't you use 1 context -> 4 queue -> 4 devices? (see this example, http://dhruba.name/2012/10/14/opencl-cookbook-how-to-leverage-multiple-devices-in-opencl/ )
you can offset your data (matrices, A, B,C ) by offset.
On Thu, Feb 11, 2016 at 8:38 AM, jkn93 notifications@github.com wrote:
I've tried to run run a single matrix-multiplication C = A x B on multiple GPUs by splitting the columns of matrix B into multiple batches.
Therefore, I've set up different contexts/com-queues for each GPU-device and executed the dgemm-batches within an openmp-parallelized loop (see code-snippet below).
Using a test-system with 250x250 matrices and 4 GPUs (B-col batches: 62,62,62,64), the following error pops up:
OpenCL error -34 on line 281 of
/clBLAS-master/src/library/blas/xgemm.cc : /clBLAS-master/src/library/blas/xgemm.cc:281: void enqueueGemmKernel(cl_command_queue, cl_kernel, void**, size_t_, unsigned int, const size_t_, const size_t_, cl_uint, _cl_event_ const*, _cl_event**): Assertion `false' failed. However, no errors pop up if the call to clblasDgemm is within a omp critical section. // skipped error-checks... std::vector theAs(ocl_devices.size()); std::vector theBs(ocl_devices.size()); std::vector theCs(ocl_devices.size()); std::vector cxGPUContext(ocl_devices.size()); std::vector commandQueue(ocl_devices.size()); // setup ctx/com and allocated mem on devs #pragma omp parallel for default(shared) schedule(dynamic) for(size_t ii=0;ii Tingxing dong
Thanks you the interesting links, although it also suggests my approach is a valid one. Anyway, in my target application I have to deal with far larger matrices, i.e., several GB per matrix in the more extreme cases, so that I need to have control over memory allocation, the prospect of 'lazy memory allocation' isn't very reassuring.
@jkn93 clBLAS does not advertises itself as thread safe. From looking at the source code I am pretty sure it is NOT thread safe. You can file a feature request, but this issue will not be trivial to fix.
@pavanky Thank you for this crucial info. Although, it's a pitty.
BTW, according to the 'Library and API documentation':
This library is entirely thread-safe with the exception of the following API : clblasSetup and clblasTeardown. Developers using the library can safely using any blas routine from different thread.
I kind of relied on that at first. Even w/o mentioning it, I guess most people will simply assume thread-safety nowadays. In order to help other developers and if thread-safety is not a major concern in the development of clBlas, I'd suggest to stress that point more clearly, evtl. in file README.md.
@jkn93 I did not realize this is how it is documented. There has been a lot of overhaul internally that is not thread safe.
@kknox @tingxingdong @TimmyLiu @guacamoleo
Do you guys want to change the documentation to reflect this ?
for(size_t isub=0;isub<nsub;isub++){ for(size_t ii=0;ii<ocl_devices.size();ii++){
this two loops are independent?
so can you switch the order? put the device.size() loop outside. Let
cxGPUContext[ii] = clCreateContext(0, 1, &theID, NULL, NULL, &ciErrNum2);
clEnqueueWriteBuffer
clblasDgemm()
clEnqueueReadBuffer
aggregate together?
On Thu, Feb 11, 2016 at 8:38 AM, jkn93 notifications@github.com wrote:
I've tried to run run a single matrix-multiplication C = A x B on multiple GPUs by splitting the columns of matrix B into multiple batches.
Therefore, I've set up different contexts/com-queues for each GPU-device and executed the dgemm-batches within an openmp-parallelized loop (see code-snippet below).
Using a test-system with 250x250 matrices and 4 GPUs (B-col batches: 62,62,62,64), the following error pops up:
OpenCL error -34 on line 281 of
/clBLAS-master/src/library/blas/xgemm.cc : /clBLAS-master/src/library/blas/xgemm.cc:281: void enqueueGemmKernel(cl_command_queue, cl_kernel, void**, size_t_, unsigned int, const size_t_, const size_t_, cl_uint, _cl_event_ const*, _cl_event**): Assertion `false' failed. However, no errors pop up if the call to clblasDgemm is within a omp critical section. // skipped error-checks... std::vector theAs(ocl_devices.size()); std::vector theBs(ocl_devices.size()); std::vector theCs(ocl_devices.size()); std::vector cxGPUContext(ocl_devices.size()); std::vector commandQueue(ocl_devices.size()); // setup ctx/com and allocated mem on devs #pragma omp parallel for default(shared) schedule(dynamic) for(size_t ii=0;ii Tingxing dong
@tingxingdong
Don't I need the context for the allocation of the device memory? Furthermore, this is only a test-routine and within the production code there are other instance where I definitely need thread-safety.
Hey jkn93. I may have been the one who broke thread safety in the gemm code. I'd like to ask you a few questions to help me root-cause the problem. 0) Are you building clBLAS from source or did you just download the library binary already compiled?
If you built if from source:
1) I see your're getting error "-34". In your copy of cl.h does "-34" correspond with CL_INVALID_CONTEXT?
2) What command is on line on line 281 of
If you just downloaded the library binary: 1) What version did you download?
These questions will help me determine if the problem is that clBLAS is not supporting multiple different contexts. I think this is the problem, however the fact that you can use a critical section to fix the problem refutes my assumption that the multiple contexts is the problem, because in the critical section you're still using multiple different contexts.
@jkn93 On a related note clBLAS calls are non blocking. I think putting all clBLAS calls in a mutex lock should not degrade decrease the performance.
@guacamoleo Hope the info helps... @pavanky That's what I'd certainly hope. However, some parts of my code where I would consider using clblasdgemm are rather complex and it would be rather tedious to pass the same mutex around.
0) Are you building clBLAS from source or did you just download the library binary already compiled?
Build from source
If you built if from source: 1) I see your're getting error "-34". In your copy of cl.h does "-34" correspond with CL_INVALID_CONTEXT?
yes
2) What command is on line on line 281 of /clBLAS-master/src/library/blas/xgemm.cc; is it clSetKernelArg or clEnqueueNDRange?
279-281: clEnqueueNDRangeKernel. However, I think I've added some printf. The lines are part of the following routine:
void enqueueGemmKernel(
cl_command_queue clQueue,
cl_kernel clKernel,
void **kernelArgs,
size_t *kernelArgSizes,
unsigned int numKernelArgs,
const size_t *globalWorkSize,
const size_t *localWorkSize,
cl_uint numEventsInWaitList,
const cl_event *eventWaitList,
cl_event *clEvent)
{
for (unsigned int i = 0; i < numKernelArgs; i++) {
CL_CHECK( clSetKernelArg( clKernel, i, kernelArgSizes[i], kernelArgs[i]) )
}
/*printf("global={%llu, %llu} local={%llu, %llu}\n",
globalWorkSize[0], globalWorkSize[1],
localWorkSize[0], localWorkSize[1] );*/
CL_CHECK( clEnqueueNDRangeKernel( clQueue, clKernel,
2, NULL, globalWorkSize, localWorkSize,
numEventsInWaitList, eventWaitList, clEvent ) )
}
We have just merged in a fix to the develop branch which should fix all GEMM thread safety issues; please test and re-issue bug if not resolved.
while I am trying to build from source code, would you mind to update the release as well ? https://github.com/clMathLibraries/clBLAS/releases I had a similar problem (-35) using the release code for Fiji
thanks
I could generate a package from the master (hurray) when I run "gemm" single GPU it is OK but when I run gemm using two threads and two GPUs, it run by software
I am testing clBLAS on a Radeon ProDuo, it will be nice to use both GPUs concurrently Would you mind to share a release as well so no concerns about how I build it from my Ubuntu Box
Thank you @gacamoleo
An example using clBLAS-2.11.0-Linux-x64 (package from master) ./sgemm2 2 10000 10000 10000 10 0 ----------> get time 4.755824e+01 sec<------ 2 GFLOPS 420.537027
using directly the Fiji Release clBLAS-2.10.0-Fiji-Linux-x64-CL2.0 OpenCL error -45 on line 281 of /home/fpadmin/Timmy/clBLAS2-10/clBLAS/src/library/blas/xgemm.cc Segmentation fault (core dumped)
With the old /opt/clAmdBlas-1.10.321 ----------> get time 8.035512e+00 sec<------ 2 GFLOPS 2,488.951544
how can I open an issue ?
Are you using master or development branch? We put a fix into our code for multi-threaded applications, but it’s probably only in the development branch, not in the master branch.
David E. Tanner
Sr Software Engineer | Radeon Technologies Group – Open Compute
From: paolodalberto [mailto:notifications@github.com] Sent: Thursday, May 19, 2016 6:27 PM To: clMathLibraries/clBLAS clBLAS@noreply.github.com Cc: Tanner, David David.Tanner@amd.com; State change state_change@noreply.github.com Subject: Re: [clMathLibraries/clBLAS] thread-safety fail: clblasDgemm assertion fail within openmp-environment using multiple GPUs (#226)
An example using clBLAS-2.11.0-Linux-x64 (package from master) ./sgemm2 2 10000 10000 10000 10 0 ----------> get time 4.755824e+01 sec<------ 2 GFLOPS 420.537027
using directly the Fiji Release clBLAS-2.10.0-Fiji-Linux-x64-CL2.0 OpenCL error -45 on line 281 of /home/fpadmin/Timmy/clBLAS2-10/clBLAS/src/library/blas/xgemm.cc Segmentation fault (core dumped)
With the old /opt/clAmdBlas-1.10.321 ----------> get time 8.035512e+00 sec<------ 2 GFLOPS 2,488.951544
how can I open an issue ?
— You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHubhttps://github.com/clMathLibraries/clBLAS/issues/226#issuecomment-220475503
I've tried to run run a single matrix-multiplication C = A x B on multiple GPUs by splitting the columns of matrix B into multiple batches.
Therefore, I've set up different contexts/com-queues for each GPU-device and executed the dgemm-batches within an openmp-parallelized loop (see code-snippet below).
Using a test-system with 250x250 matrices and 4 GPUs (B-col batches: 62,62,62,64), the following error pops up:
However, no errors pop up if the call to clblasDgemm is within a omp critical section.