clMathLibraries / clBLAS

a software library containing BLAS functions written in OpenCL
Apache License 2.0
839 stars 240 forks source link

Memory leak on setup/teardown in 2.11 #247

Open hughperkins opened 8 years ago

hughperkins commented 8 years ago

To reproduce, run:

#include <iostream>
#include <sys/types.h>
#include <stdio.h>
#include <string.h>
#include <clBLAS.h>
#include <stdlib.h>
using namespace std;

  cl_int err;
  cl_platform_id platform = 0;
  cl_device_id device = 0;
  cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
  cl_context ctx = 0;
  cl_command_queue queue = 0;
  cl_mem bufA, bufB, bufC;
  cl_event event = NULL;
  int ret = 0;

void clgemm(int colmaj, char transAchar, char transBchar, int M, int N, int K, float alpha, float *A, int lda,
     float *B, int ldb, float beta, float *C, int ldc, float *result) {
clblasTranspose transA = transAchar == 'n' ? clblasNoTrans : clblasTrans;
clblasTranspose transB = transBchar == 'n' ? clblasNoTrans : clblasTrans;

size_t off = 0;
size_t offA = 0;
size_t offB = 0;
size_t offC = 0;

clblasOrder order;
if(colmaj == 1 ) {
  order = clblasColumnMajor;
} else {
  order = clblasRowMajor;
}

  bufA = clCreateBuffer(ctx, CL_MEM_READ_ONLY, M * K * sizeof(*A),
                        NULL, &err);
  bufB = clCreateBuffer(ctx, CL_MEM_READ_ONLY, K * N * sizeof(*B),
                        NULL, &err);
  bufC = clCreateBuffer(ctx, CL_MEM_READ_WRITE, M * N * sizeof(*C),
                        NULL, &err);

  err = clEnqueueWriteBuffer(queue, bufA, CL_TRUE, 0,
      M * K * sizeof(*A), A, 0, NULL, NULL);
  err = clEnqueueWriteBuffer(queue, bufB, CL_TRUE, 0,
      K * N * sizeof(*B), B, 0, NULL, NULL);
  err = clEnqueueWriteBuffer(queue, bufC, CL_TRUE, 0,
      M * N * sizeof(*C), C, 0, NULL, NULL);

  err = clblasSgemm(order, transA, transB, M - off, N - off, K - off,
                       alpha, bufA, offA, lda,
                       bufB, offB, ldb, beta,
                       bufC, offC, ldc,
                       1, &queue, 0, NULL, &event);
  if (err != CL_SUCCESS) {
      printf("clblasSgemmEx() failed with %d\n", err);
      ret = 1;
      exit(1);
  }
  else {
      err = clWaitForEvents(1, &event);
      err = clEnqueueReadBuffer(queue, bufC, CL_TRUE, 0,
                                M * N * sizeof(*result),
                                result, 0, NULL, NULL);
      clReleaseEvent(event);
  }

  clReleaseMemObject(bufC);
  clReleaseMemObject(bufB);
  clReleaseMemObject(bufA);

}

bool test1(int colmaj, int M, int N, int K, int transAint, int transBint) {
  char transa = transAint == 1 ? 't' : 'n';
  char transb = transBint == 1 ? 't' : 'n';

  float alpha = 1; int lda = 1; int ldb = 1; int ldc = 1; float beta = 0;
  float *A = new float[M * K];
  float *B = new float[K * N];
  float *C = new float[M * N];
  float *clout = new float[M * N];
  clgemm(colmaj, transa, transb, M, N, K, alpha, A, lda,
     B, ldb, beta, C, ldc, clout);
  delete[] A;
  delete[] B;
  delete[] C;
  delete[] clout;
}

int main(int argc, char *argv[]) {
  clewInit();

  err = clGetPlatformIDs(1, &platform, NULL);
  if (err != CL_SUCCESS) {
      printf( "clGetPlatformIDs() failed with %d\n", err );
      return 1;
  }
  cout << "got platforms" << endl;

  err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
  if (err != CL_SUCCESS) {
      printf( "clGetDeviceIDs() failed with %d\n", err );
      return 1;
  }

  props[1] = (cl_context_properties)platform;
  for(int i = 0; i < 100000; i++ ) {
    cout << "i " << i << endl;
    ctx = clCreateContext(props, 1, &device, NULL, NULL, &err);
    queue = clCreateCommandQueue(ctx, device, 0, &err);
    clblasSetup();
    test1(1, 1, 1, 1, 0, 0);
    clblasTeardown();
    clReleaseCommandQueue(queue);
    clReleaseContext(ctx);
  }

  return 0;
}

Monitor memory usage whilst running, and ctrl-c out within ~5-10seconds, to prevent computer freezing.

hughperkins commented 8 years ago

After commenting out a bunch of stuff around the place, fairly sure that there is a leak which is not in any of the following places (which doesnt mean these places dont contain an additional leak):

Edit1:

hughperkins commented 8 years ago

Seems like it's something to do with the map, in xgemm.cc, which increases in size, without limit:

Add to xgemm.cc, after line (*kernel_map)[key] = *clKernel;

printf("map size after put %i\n", kernel_map->size());

Result:

map size after put 1
map size after put 2
map size after put 3
map size after put 4
map size after put 5
map size after put 6
...
hughperkins commented 8 years ago

It's a static map, embedded in a function, so it will be challenging to figure out a workable solution I think?

hughperkins commented 8 years ago

I think one way to handle this, without having to think about threading, and contexts and stuff is, we have like a 'setup version number', which is incremented each time we call setup (or maybe teardown, but comes to the same thing). In makeGemmKernel, it has a static int with the last seen setup version number. If it doesnt match the current setup version number, then makeGemmKernel .... oh... hmmm. its too late to release those kernels actually... :-(

hughperkins commented 8 years ago

Per http://stackoverflow.com/questions/15067160/stdmap-thread-safety/15067564#15067564 , you can read/write a map from different threads, as long as you dont access the same items, or iterate. Sooo... pondering:

hughperkins commented 8 years ago

(Not sure if this would work on Windows easily though :-( Seems like this is non-obvious to do http://stackoverflow.com/questions/1679243/getting-the-thread-id-from-a-thread )

pavanky commented 8 years ago

@hughperkins the map needs to be thread local static variable. Otherwise you'll end up with a bug or compile the kernels repeatedly. I think the better option would be to clear the map from teardown.

OK I see that's exactly what you suggested. I don't know how thread Id works with non C++11 threads (for example if someone is using openmp instead of c++11 threads).

But I have to ask, in what situation would you really need to call setup and teardown repeatedly.. You would only ever need to call them once in any program.

hughperkins commented 8 years ago

Well, in my unit tests for example. Since I need to test kernel creation, caching etc, I need to tear down the opencl context at the end of each test, and create a new one for the new one. Not doing this would workaround the issue, but mean my tests miss a whole bunch of issues in my own code.

hughperkins commented 8 years ago

How to clear the map from teardown? Since the map is thread-local, seems that the teardown wont have access to other thread's maps? Or, you mean, each thread should be calling teardown?

pavanky commented 8 years ago

@hughperkins I updated my earlier comment. This is an interesting problem. An alternative to using thread id would be to register the map during construction in a mutex locked variable (like an std::vector of pointers to the map). When teardown is called, you can just go through the registry and delete all of them.

hughperkins commented 8 years ago

Yes, mutex-looked variable sounds correct. Since it sounds like a bunch of work, and socialization and so on, and since my own use-cases is single-threaded for now, I think I'm just going to create my own fork, move the map to being non-threadlocal, and clear it in teardown. If I ever have to think about threading, I'll put a mutex around it, and create a PR into develop branch.

pavanky commented 8 years ago

@hughperkins no reason to fork. I can send in a patch that'll fix this in < 1 hour. Try to be as close to upstream as possible :)

hughperkins commented 8 years ago

Oh, nice! :-)

pavanky commented 8 years ago

@hughperkins me sending a patch doesnt mean it will be merged immediately. You can test the patch though.

hughperkins commented 8 years ago

Yes, understood :-)

hughperkins commented 8 years ago

So, I actually created a non-mutexed, single-threaded teardown for xgemm map https://github.com/clMathLibraries/clBLAS/compare/develop...hughperkins:xgemm_teardown?expand=1 However, that doesnt stop the leak. I havent measured the exact leak rate, and I assume it's gone down a bit, but it's certainly not zero yet. Perhaps I need to measure the leak rate...

hughperkins commented 8 years ago

So, I took some measurements of memory usage. Turns out that the memory leak is about 70MB per setup/sgemm/teardown cycle. The sucky thing is that this is true with or without xgemm map teardown :-(

hughperkins commented 8 years ago

(Note: seems I had the order ot releasequeue and releasecontext in the wrong order earlier; corrected; but issue persists) Inspecting memory usage step by step. Without gemm, memory usage changes approixmatley like:

So, the memory change is all associated with setting up / tearing down context. The strange thing is, after calling gemm, in this version of clblas, releasing the context no longer releases the memory:

(Edit: seems like clGetContextInfo with REFERNCE_COUNT will be useful https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clGetContextInfo.html )

(Edit2: reference count seems to be always reported as 1. I wonder if it's a NOP on certain GPUs?)

Edit3: so I guess it's like, per https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clReleaseContext.html, the context is either entirely deleted, or not deleted at al, all or nothing. So, if there are any references left over, it wont be deleted. releasing the kernel is essential, and there's probably some other object(s) somewhere not being deleted either...

Edit4: oops, missing a clReleaseEvent in my code above :-P Added in.

hughperkins commented 8 years ago

Yay, leak gone :-) I had to do the following things to remove it: