clMathLibraries / clBLAS

a software library containing BLAS functions written in OpenCL
Apache License 2.0
839 stars 240 forks source link

thread-safety fail: clblasDgemm assertion fail within openmp-environment using multiple GPUs #226

Closed jkn93 closed 8 years ago

jkn93 commented 8 years ago

I've tried to run run a single matrix-multiplication C = A x B on multiple GPUs by splitting the columns of matrix B into multiple batches.

Therefore, I've set up different contexts/com-queues for each GPU-device and executed the dgemm-batches within an openmp-parallelized loop (see code-snippet below).

Using a test-system with 250x250 matrices and 4 GPUs (B-col batches: 62,62,62,64), the following error pops up:

OpenCL error -34 on line 281 of <path>/clBLAS-master/src/library/blas/xgemm.cc
<binary>: <path>/clBLAS-master/src/library/blas/xgemm.cc:281: void enqueueGemmKernel(cl_command_queue, cl_kernel, void**, size_t*, unsigned int, const size_t*, const size_t*, cl_uint, _cl_event* const*, _cl_event**): Assertion `false' failed.

However, no errors pop up if the call to clblasDgemm is within a omp critical section.

  // skipped error-checks...
  std::vector<cl_mem> theAs(ocl_devices.size());
  std::vector<cl_mem> theBs(ocl_devices.size());
  std::vector<cl_mem> theCs(ocl_devices.size());
  std::vector<cl_context> cxGPUContext(ocl_devices.size());
  std::vector<cl_command_queue> commandQueue(ocl_devices.size());

  // setup ctx/com and allocated mem on devs
  #pragma omp parallel for default(shared) schedule(dynamic)
  for(size_t ii=0;ii<ocl_devices.size();ii++){
    cl_device_id theID = ocl_devices[ii];
    cl_int ciErrNum2 = CL_SUCCESS;
    cxGPUContext[ii] = clCreateContext(0, 1, &theID, NULL, NULL, &ciErrNum2);
    #ifdef CL_VERSION_2_0
      commandQueue[ii] = clCreateCommandQueueWithProperties(cxGPUContext[ii], theID, NULL, &ciErrNum2);
    #else
      commandQueue[ii] = clCreateCommandQueue(cxGPUContext[ii], theID, 0, &ciErrNum2);
    #endif

    theAs[ii] = clCreateBuffer(cxGPUContext[ii], CL_MEM_READ_ONLY, ( nra*nca*sizeof(double)), NULL, &ciErrNum2);
    theBs[ii] = clCreateBuffer(cxGPUContext[ii], CL_MEM_READ_ONLY, ( nrb*(ncol_per_gpu+nrest)*sizeof(double)), NULL, &ciErrNum2);
    theCs[ii] = clCreateBuffer(cxGPUContext[ii], CL_MEM_READ_WRITE, ( nrc*(ncol_per_gpu+nrest)*sizeof(double)), NULL, &ciErrNum2);

    clEnqueueWriteBuffer(commandQueue[ii], theAs[ii], CL_TRUE, 0, ( nra*nca*sizeof(double)), Amat, 0, NULL, NULL);
  }

 // execute batches
  for(size_t isub=0;isub<nsub;isub++){
    #pragma omp parallel for default(shared) schedule(dynamic)
    for(size_t ii=0;ii<ocl_devices.size();ii++){
      size_t N = ncols[isub*ocl_devices.size()+ii];
      clEnqueueWriteBuffer(commandQueue[ii], theBs[ii], CL_TRUE, 0, ( nrb*N*sizeof(double)),
                                       Bmat+nrb*ncol_per_gpu*(isub*ocl_devices.size()+ii), 0, NULL, NULL);
      cl_event event = NULL;
     // works if the following com is within omp-critical section
     clblasDgemm(clblasColumnMajor, clblasNoTrans, clblasNoTrans, M, N, K, one, theAs[ii],
                         0,lda, theBs[ii], 0,ldb, zero, theCs[ii], 0,ldc, 1, &commandQueue[ii], 0, NULL, &event);

      ciErrNum2 = clWaitForEvents(1, &event);
      ciErrNum2 = clEnqueueReadBuffer(commandQueue[ii], theCs[ii], CL_TRUE, 0, nrc*N* sizeof(double),
                                        Cmat+nrc*ncol_per_gpu*(isub*ocl_devices.size()+ii), 0, NULL, NULL);

    }
  }
tingxingdong commented 8 years ago

so you have 4 context-> 4 queues -> 4 devices.

why don't you use 1 context -> 4 queue -> 4 devices? (see this example, http://dhruba.name/2012/10/14/opencl-cookbook-how-to-leverage-multiple-devices-in-opencl/ )

you can offset your data (matrices, A, B,C ) by offset.

On Thu, Feb 11, 2016 at 8:38 AM, jkn93 notifications@github.com wrote:

I've tried to run run a single matrix-multiplication C = A x B on multiple GPUs by splitting the columns of matrix B into multiple batches.

Therefore, I've set up different contexts/com-queues for each GPU-device and executed the dgemm-batches within an openmp-parallelized loop (see code-snippet below).

Using a test-system with 250x250 matrices and 4 GPUs (B-col batches: 62,62,62,64), the following error pops up:

OpenCL error -34 on line 281 of /clBLAS-master/src/library/blas/xgemm.cc

: /clBLAS-master/src/library/blas/xgemm.cc:281: void enqueueGemmKernel(cl_command_queue, cl_kernel, void**, size_t_, unsigned int, const size_t_, const size_t_, cl_uint, _cl_event_ const*, _cl_event**): Assertion `false' failed. However, no errors pop up if the call to clblasDgemm is within a omp critical section. // skipped error-checks... std::vector theAs(ocl_devices.size()); std::vector theBs(ocl_devices.size()); std::vector theCs(ocl_devices.size()); std::vector cxGPUContext(ocl_devices.size()); std::vector commandQueue(ocl_devices.size()); // setup ctx/com and allocated mem on devs #pragma omp parallel for default(shared) schedule(dynamic) for(size_t ii=0;ii

Tingxing dong

jkn93 commented 8 years ago

Thanks you the interesting links, although it also suggests my approach is a valid one. Anyway, in my target application I have to deal with far larger matrices, i.e., several GB per matrix in the more extreme cases, so that I need to have control over memory allocation, the prospect of 'lazy memory allocation' isn't very reassuring.

pavanky commented 8 years ago

@jkn93 clBLAS does not advertises itself as thread safe. From looking at the source code I am pretty sure it is NOT thread safe. You can file a feature request, but this issue will not be trivial to fix.

jkn93 commented 8 years ago

@pavanky Thank you for this crucial info. Although, it's a pitty.

BTW, according to the 'Library and API documentation':

This library is entirely thread-safe with the exception of the following API : clblasSetup and clblasTeardown. Developers using the library can safely using any blas routine from different thread.

I kind of relied on that at first. Even w/o mentioning it, I guess most people will simply assume thread-safety nowadays. In order to help other developers and if thread-safety is not a major concern in the development of clBlas, I'd suggest to stress that point more clearly, evtl. in file README.md.

pavanky commented 8 years ago

@jkn93 I did not realize this is how it is documented. There has been a lot of overhaul internally that is not thread safe.

@kknox @tingxingdong @TimmyLiu @guacamoleo

Do you guys want to change the documentation to reflect this ?

tingxingdong commented 8 years ago

for(size_t isub=0;isub<nsub;isub++){ for(size_t ii=0;ii<ocl_devices.size();ii++){

this two loops are independent?

so can you switch the order? put the device.size() loop outside. Let

cxGPUContext[ii] = clCreateContext(0, 1, &theID, NULL, NULL, &ciErrNum2);

clEnqueueWriteBuffer

clblasDgemm()

clEnqueueReadBuffer

aggregate together?

On Thu, Feb 11, 2016 at 8:38 AM, jkn93 notifications@github.com wrote:

I've tried to run run a single matrix-multiplication C = A x B on multiple GPUs by splitting the columns of matrix B into multiple batches.

Therefore, I've set up different contexts/com-queues for each GPU-device and executed the dgemm-batches within an openmp-parallelized loop (see code-snippet below).

Using a test-system with 250x250 matrices and 4 GPUs (B-col batches: 62,62,62,64), the following error pops up:

OpenCL error -34 on line 281 of /clBLAS-master/src/library/blas/xgemm.cc

: /clBLAS-master/src/library/blas/xgemm.cc:281: void enqueueGemmKernel(cl_command_queue, cl_kernel, void**, size_t_, unsigned int, const size_t_, const size_t_, cl_uint, _cl_event_ const*, _cl_event**): Assertion `false' failed. However, no errors pop up if the call to clblasDgemm is within a omp critical section. // skipped error-checks... std::vector theAs(ocl_devices.size()); std::vector theBs(ocl_devices.size()); std::vector theCs(ocl_devices.size()); std::vector cxGPUContext(ocl_devices.size()); std::vector commandQueue(ocl_devices.size()); // setup ctx/com and allocated mem on devs #pragma omp parallel for default(shared) schedule(dynamic) for(size_t ii=0;ii

Tingxing dong

jkn93 commented 8 years ago

@tingxingdong

Don't I need the context for the allocation of the device memory? Furthermore, this is only a test-routine and within the production code there are other instance where I definitely need thread-safety.

guacamoleo commented 8 years ago

Hey jkn93. I may have been the one who broke thread safety in the gemm code. I'd like to ask you a few questions to help me root-cause the problem. 0) Are you building clBLAS from source or did you just download the library binary already compiled?

If you built if from source: 1) I see your're getting error "-34". In your copy of cl.h does "-34" correspond with CL_INVALID_CONTEXT? 2) What command is on line on line 281 of /clBLAS-master/src/library/blas/xgemm.cc; is it clSetKernelArg or clEnqueueNDRange?

If you just downloaded the library binary: 1) What version did you download?

These questions will help me determine if the problem is that clBLAS is not supporting multiple different contexts. I think this is the problem, however the fact that you can use a critical section to fix the problem refutes my assumption that the multiple contexts is the problem, because in the critical section you're still using multiple different contexts.

pavanky commented 8 years ago

@jkn93 On a related note clBLAS calls are non blocking. I think putting all clBLAS calls in a mutex lock should not degrade decrease the performance.

jkn93 commented 8 years ago

@guacamoleo Hope the info helps... @pavanky That's what I'd certainly hope. However, some parts of my code where I would consider using clblasdgemm are rather complex and it would be rather tedious to pass the same mutex around.

0) Are you building clBLAS from source or did you just download the library binary already compiled?

Build from source

If you built if from source: 1) I see your're getting error "-34". In your copy of cl.h does "-34" correspond with CL_INVALID_CONTEXT?

yes

2) What command is on line on line 281 of /clBLAS-master/src/library/blas/xgemm.cc; is it clSetKernelArg or clEnqueueNDRange?

279-281: clEnqueueNDRangeKernel. However, I think I've added some printf. The lines are part of the following routine:

 void enqueueGemmKernel(
   cl_command_queue clQueue,
   cl_kernel clKernel,
   void **kernelArgs,
   size_t *kernelArgSizes,
   unsigned int numKernelArgs,
   const size_t *globalWorkSize,
   const size_t *localWorkSize,
   cl_uint numEventsInWaitList,
   const cl_event *eventWaitList,
   cl_event *clEvent)
 {
   for (unsigned int i = 0; i < numKernelArgs; i++) {
     CL_CHECK( clSetKernelArg( clKernel, i, kernelArgSizes[i], kernelArgs[i]) )
   }
   /*printf("global={%llu, %llu} local={%llu, %llu}\n",
     globalWorkSize[0], globalWorkSize[1],
     localWorkSize[0], localWorkSize[1] );*/
   CL_CHECK( clEnqueueNDRangeKernel( clQueue, clKernel,
      2, NULL, globalWorkSize, localWorkSize,
      numEventsInWaitList, eventWaitList, clEvent ) )
 }
guacamoleo commented 8 years ago

We have just merged in a fix to the develop branch which should fix all GEMM thread safety issues; please test and re-issue bug if not resolved.

paolodalberto commented 8 years ago

while I am trying to build from source code, would you mind to update the release as well ? https://github.com/clMathLibraries/clBLAS/releases I had a similar problem (-35) using the release code for Fiji

thanks

paolodalberto commented 8 years ago

I could generate a package from the master (hurray) when I run "gemm" single GPU it is OK but when I run gemm using two threads and two GPUs, it run by software

I am testing clBLAS on a Radeon ProDuo, it will be nice to use both GPUs concurrently Would you mind to share a release as well so no concerns about how I build it from my Ubuntu Box

Thank you @gacamoleo

paolodalberto commented 8 years ago

An example using clBLAS-2.11.0-Linux-x64 (package from master) ./sgemm2 2 10000 10000 10000 10 0 ----------> get time 4.755824e+01 sec<------ 2 GFLOPS 420.537027

using directly the Fiji Release clBLAS-2.10.0-Fiji-Linux-x64-CL2.0 OpenCL error -45 on line 281 of /home/fpadmin/Timmy/clBLAS2-10/clBLAS/src/library/blas/xgemm.cc Segmentation fault (core dumped)

With the old /opt/clAmdBlas-1.10.321 ----------> get time 8.035512e+00 sec<------ 2 GFLOPS 2,488.951544

how can I open an issue ?

guacamoleo commented 8 years ago

Are you using master or development branch? We put a fix into our code for multi-threaded applications, but it’s probably only in the development branch, not in the master branch.

David E. Tanner


Sr Software Engineer | Radeon Technologies Group – Open Compute

From: paolodalberto [mailto:notifications@github.com] Sent: Thursday, May 19, 2016 6:27 PM To: clMathLibraries/clBLAS clBLAS@noreply.github.com Cc: Tanner, David David.Tanner@amd.com; State change state_change@noreply.github.com Subject: Re: [clMathLibraries/clBLAS] thread-safety fail: clblasDgemm assertion fail within openmp-environment using multiple GPUs (#226)

An example using clBLAS-2.11.0-Linux-x64 (package from master) ./sgemm2 2 10000 10000 10000 10 0 ----------> get time 4.755824e+01 sec<------ 2 GFLOPS 420.537027

using directly the Fiji Release clBLAS-2.10.0-Fiji-Linux-x64-CL2.0 OpenCL error -45 on line 281 of /home/fpadmin/Timmy/clBLAS2-10/clBLAS/src/library/blas/xgemm.cc Segmentation fault (core dumped)

With the old /opt/clAmdBlas-1.10.321 ----------> get time 8.035512e+00 sec<------ 2 GFLOPS 2,488.951544

how can I open an issue ?

— You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHubhttps://github.com/clMathLibraries/clBLAS/issues/226#issuecomment-220475503