Fast parallel GEMV - Githubissues

hunse commented 9 years ago

I'm interested in doing a lot of GEMV operations in parallel (all with different matrices and vectors). clBLAS can do this, but I find it's faster if I write my own kernel that does all the GEMV operations at once. Is there any way I can send these operations to clBLAS to make them execute more in parallel and get better throughput on my device? For a single GEMV operation, clBLAS is much faster than my custom kernel.

pavanky commented 9 years ago

How big are the GEMV operations ?

hunse commented 9 years ago

They can vary quite a bit, but generally assumed to be on the smaller side (as far as these things go), with the matrix being 100 x 100 to 1000 x 1000 ish.

pavanky commented 9 years ago

I could have sworn that clBLAS had a batched mode for sgemm / sgemv, but I can not seem to find it. May be I am mis-remembering things from cuBLAS.

You could try creating an unique queue for each operation.

hunse commented 9 years ago

I think cuBLAS has a batched mode; maybe you're thinking of that. I tried using multiple queues, but it doesn't help :(

pavanky commented 9 years ago

Can you update the original post with a sample code, hardware info and the times you are seeing ? The best person to answer this would be @TimmyLiu , but they are a busy lot at AMD :)

I will try to experiment a little bit on my end as well in the mean time.

TimmyLiu commented 9 years ago

There is no batched API as of now. You can use multiple queues. But there is a catch (someone in my group experimented this recently), you have to call clflush() on each queue before any blocking call such as clwaitforevent to have your kernel calls overlapped. Otherwise multiple queues will serialized I think.

hunse commented 9 years ago

I think my problem is that I have so many operations (kernels), and they're all so small (in terms of data processed), that even with multiple kernels running in parallel it still takes a long time. Only 16 kernels can run in parallel (I'm not sure if it's devices or OpenCL providing this limitation), which means that with many kernels (>> 16), a lot will still have to happen in serial, whereas one big kernel can do it all at once.

It would be really great if I could use the clBLAS GEMV code as part of my one big (batch) kernel, rather than writing my own batched GEMV kernel (since my kernel is not near as fast as the clBLAS one for single operations). I guess doing this would be equivalent to writing a batched API of sorts. How feasible would this be? Is it just a matter of wrapping the existing code in some way, or would everything pretty much to be rewritten?

pavanky commented 9 years ago

GEMV should ideally be limited by bandwidth. Can you calculate the performance (in GBPS) you are getting from your kernel as opposed to GEMV in clBLAS ?

kknox commented 9 years ago

You should be able to use the xGEMM interface as a batched xGEMV interface. This implies a memory preprocessing/postprocessing step on host side, to swizzle vectors in strided fashion into a single matrix buffer. The question is if the computational gain you get by batching your vectors is more than preprocessing/postprocessing cost. Possibly, you could optimize your host app to operate on the strided vectors directly, and minimize the stride conversion steps.

bhack commented 9 years ago

/cc @naibaf7

hunse commented 9 years ago

I'm having trouble even just getting simple kernels to run simultaneously on my NVIDIA card (GTX 780). There was a lot of discussion a number of years ago as to whether NVIDIA supported concurrent kernels at all, and it seems like the newer cards do (Fermi and newer). But there was still a lot of debate as to whether this is supported by their OCL drivers, and I couldn't find any conclusive evidence that this is supported.

@kknox: I'm not exactly sure how you mean. If my matrix is different for all the GEMV operations, then wouldn't my A matrix for GEMM have to be block diagonal? This is going to result in a very sparse matrix, and I'm not convinced it would be much faster. Is there some other way to set it up?

naibaf7 commented 9 years ago

@hunse Did you try with multiple queues? I see you mentioned it but it did not work correctly? I am currently working on porting Caffe to OpenCL (it's basically done, but not fully optimized): https://github.com/naibaf7/caffe and https://github.com/BVLC/caffe/pull/2610

And I use up to 8 OpenCL queues, then I launch the GEMM calls in a round-robin fashion. When having 10 GEMM calls, the speedup observed was 30% on a GTX 980 compared to launching all GEMM calls on a single OpenCL queue. A similar effect was visible on a W9100, although not quite as much as on the GTX 980.

hunse commented 9 years ago

I've tried multiple queues, flushing them and everything, and it doesn't seem to make any difference.

TimmyLiu commented 9 years ago

I am not sure if nvidia's driver will force the serialization of queues. Do you have any AMD device you can try the approach with different queues? Anyway it might be beneficial to have the batched API for certain important routines.

TimmyLiu commented 9 years ago

Hi hunse, I did some experiments around this. After creating 1 queue for each batch. I had a for loop calling sgemv. Calling clflush and clwaitforevent I was able to see kernels running concurrently.

    for (int i = 0; i < batchSize; i++)
    {
        err = clblasSgemv(order, transA, M, N, alpha,
            bufA[i], offA, lda, bufX[i], offx, incx, beta,
            bufY[i], offy, incy, 1, &queue[i], 0, NULL, &event[i]);
    }
    for (int i = 0; i < batchSize; i++)
    {
        clFlush(queue[i]);
    }
    err = clWaitForEvents(batchSize, &event);

And you can see from this screenshot of codexl some of the kernels are overlapped. batched4096

In the above example M and N are fairly big (4096) for each batch. I noticed if I reduce the matrix size the kernel overlapping is less obvious. If I choose M=N=512 for example, I don't see the kernel overlap from codexl. This is probably due to the overhead of launch kernels being more significant for smaller matrix sizes.

changing the above code to something like

    cl_event user_event = clCreateUserEvent(ctx, &err);

    for (int i = 0; i < batchSize; i++)
    {
        err = clblasSgemv(order, transA, M, N, alpha,
            bufA[i], offA, lda, bufX[i], offx, incx, beta,
            bufY[i], offy, incy, 1, &queue[i], 1, &user_event, &event[i]);
    }

    clSetUserEventStatus(user_event, CL_COMPLETE);

    for (int i = 0; i < batchSize; i++)
    {
        clFlush(queue[i]);
    }
    err = clWaitForEvents(batchSize, &event);

I can sometime capture kernel overlap from codexl. batched512

Hope this helps.

hunse commented 9 years ago

Thanks @TimmyLiu, that does help. I think for what I'm doing, I need some way to put all my GEMVs into one kernel call. Since I have so many small ones, even if I was able to get a number of kernels running concurrently, I still don't think I'd get very good occupancy. As you've shown, the overhead with launching kernels is significant, and even if it wasn't, I'd still only be able to have at most 16 simultaneous kernels.

How difficult would it be to create a batched kernel for GEMV?

naibaf7 commented 9 years ago

@TimmyLiu While we're on it, batched row-major NN-SGEMM would also be required to match the performance of cuDNN on https://github.com/naibaf7/caffe (Caffe OpenCL).

So @hunse is definitely not the only one profiting from having batched kernels, but also the whole machine learning / neural network development.

But I am not sure how the calling parameters of a batched GEMM/GEMV would look like.

TimmyLiu commented 9 years ago

Thanks guys for the feed back. I definitely agree a batched interface is nice to have. @hunse @naibaf7 what would be typical usage (number of batches, matrix size per batch) of a batched gemv and gemm?

naibaf7 commented 9 years ago

A typical value... hard to come up with, but the batch sizes can be around 256 on certain networks. But due to memory limits 8 to 64 at once are reasonable for the GEMM calls. Taking the Alexnet, a typical layer that runs really fast on the cuDNN library has GEMM calls with:

M = 192
N = 27^2 = 729
K = 3^2 * 128 = 1152

hunse commented 9 years ago

For me, typical usage would be many batches (from 100 to upwards of 100,000) and generally smaller matrices (as small as 1x3, often around 100x2, 500x3, possibly as large as 10000x2000, typically quite rectangular, but sometimes square as well). Since I'm doing a lot of dynamic neural network stuff, GEMM is actually much less useful for me.

naibaf7 commented 9 years ago

@hunse @TimmyLiu Seems like the interface would have to be very general... probably not very easy.

Would it be another option to have a possibility to call the GEMM / GEMV building blocks / kernels within custom kernel code in order to achieve block-GEMM/GEMV?

hunse commented 9 years ago

For the API, I was thinking of making multiple calls to a GEMV-like function, to populate a list of all the calls for the batch, and then another function that takes the list and actually enqueues the batch. That way, the API for the individual GEMV calls can remain mostly the same.

naibaf7 commented 9 years ago

@hunse I agree, that would also work for me. But how would the underlying OpenCL kernel calls work with that? If it would just enqueue the individual GEMV calls as before, it would not help. How would the vectors and matrices reside in memory? If it is only one clMem object, then it will be easy but I can't imagine an easy way to handle multiple or arbitrary clMem objects. Then, a list/array of vector/matrix offsets would also have to be provided as clMem object, right?

hunse commented 9 years ago

Yes, you'd need to provide a list of offsets or pointers to the start of each memory block, as well as lists of all the shapes and strides. I see what you're getting at, now. In what I'm doing, we do the each operation many times, so we can allocate memory on the device to store these lists, and then just refer to them each time we want to do the operation. But if we're just building and executing the batched GEMV on the fly, then we'd have to allocate this memory each time, which could be costly.

Maybe we'd have to leave this up to the user, then? They'd have to allocate memory on the device and load up all the lists, then pass this to a batched GEMV call.

naibaf7 commented 9 years ago

@hunse Exactly. As this anyways is a special high-performance feature, it should be left to the user to create and manage offset, stride etc. data in cl_mem objects. Then the batched GEMV/GEMM calls would take the arguments as they are now, but instead of having host-memory integer arguments next to the clMem objects for the actual matrix and vector storage, all arguments are clMem, except for one integer which defines the batch size to hint the implementation how many entries in the offset, stride etc. cl_mem objects have to be read.

Maybe @TimmyLiu has better ideas on this. Also, should the additional cl_mem objects have special properties so that reading them fast on the GPU is possible? The entries (offset, stride,..) are constant during the kernel call, if that helps.

naibaf7 commented 9 years ago

Update: Intel MKL adds batch-GEMM support, see: https://software.intel.com/en-us/articles/introducing-batch-gemm-operations for an inspiration of what is needed. It can be a bit less flexible and keep A, B and C matrices in the same memory objects for OpenCL though.

clMathLibraries / clBLAS

Fast parallel GEMV #129