How can we specify we want to use BatchedGemm api and also the batch size while running Caffe with CLBLAST Library?

dividiti / ck-caffe

Collective Knowledge workflow for Caffe to automate installation across diverse platforms and to collaboratively evaluate and optimize Caffe-based workloads across diverse hardware, software and data sets (compilers, libraries, tools, models, inputs):

http://cKnowledge.org

BSD 3-Clause "New" or "Revised" License

193 stars 40 forks source link

How can we specify we want to use BatchedGemm api and also the batch size while running Caffe with CLBLAST Library? #119

Closed abhi1212 closed 6 years ago

abhi1212 commented 7 years ago

@psyhtest @gfursin

psyhtest commented 7 years ago

@abhi1212 I have not followed the very latest developments but I don't think that the OpenCL branch of Caffe supports the batched GEMM API of CLBlast. However, you can still request to process your data using batches. If you are interested, I can tell you how to do this via CK-Caffe.

/cc @cnugteren @naibaf7

abhi1212 commented 7 years ago

@psyhtest , I would love to know how we do that. But are we referring to the number of images processed in a batch?

abhi1212 commented 7 years ago

Also, @psyhtest we do have a batched Gemm api for Caffe2.

psyhtest commented 7 years ago

@abhi1212 Please take a look at greentea_math_functions.cpp to see which CLBlast functions are called from Caffe.

You can run something like this to process several images per batch (e.g. 4):

$ ck run program:caffe --env.CK_CAFFE_BATCH_SIZE=4

CNugteren commented 7 years ago

So, if I understand it correctly, there is no batched version of CLBlast used there right now. Maybe we should see how we can integrate that to get better performance?

abhi1212 commented 7 years ago

Yes, @CNugteren we could combine BatchedGemm of Clblast with caffe and observe the performance.

naibaf7 commented 7 years ago

Currently there's no BatchedGemm for OpenCL Caffe. The three options with OpenCL are subgroup convolutions (by Intel), fused im2col+gemm implementations (LibDNN), which do im2col of a whole batch plus GEMM in one single kernel (data rearrangement happens in GPU local memory), and the classical im2col+GEMM, which repeats (im2col+GEMM) N times to do the whole batch.

abhi1212 commented 7 years ago

Wow, that seems interesting, in the classical im2col plus Gemm are you referring to the groups which we have in Caffe? Also, are Intel Subgroup Convolution limited to Cpu? @naibaf7

naibaf7 commented 7 years ago

@abhi1212 No, intel subgroups are a specific technology that their GPUs have. Hence, Intel Subgroup Convolution is limited to Intel iGPUs, not CPUs. See here: https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.txt

abhi1212 commented 7 years ago

Thanks a lot @naibaf7 , unfortunately I dont have an Intel Gpu, so I can not run Intel subgroup convolutions. The fused im2col+gemm implementations (LibDNN) does it do all the convolution for a single layer at once? Also, when we use ck caffe with libdnn is the fused convolution default one or we need to specify it? Also, talking about classical convolution we do it N times for completing a particular batch, so from where do we select this parameter N, I thought it must be the group parameter in Caffe but it seems that it is not used for optimizing Caffe convolution..

naibaf7 commented 7 years ago

@abhi1212 For AMD GPUs you can run LIBDNN convolutions, for nVidia GPUs you can run CUDNN convolutions, for Intel GPUs you can use Intel Spatial/Subgroup convolutions. Yes LibDNN does the whole convolution at once. One kernel call does all N elements of a batch. The only GPUs that don't have a good convolution engine yet are all the ones you usally see coming with ARM based CPUs (Mali, Adreno, PowerVR)... this needs fixing.

N is not the group parameter, it's the batch count/dimension in a blob/tensor. The group parameter does something else: it divides the input and output feature maps into X groups so that the convolution operation becomes cheaper by not convolving over all feature maps at once.

abhi1212 commented 7 years ago

Thanks @naibaf7, as Nvidia Gpus also support Opencl, cant I run Caffe with Libdnn on Nvidia Gpus? I have worked with Cudnn in the past wanted to use LIbdnn or Clblast with Caffe for an Nvidia Gpu.

Also, can you please help me with the N parameter, how can we divide an entire batch into small sizes and then do im2col and gemm on it. Where can I find this parameter?

naibaf7 commented 7 years ago

@abhi1212 You can't. If you want to do im2col, it will be executed N times. It would also not make sense to do im2col more often than once, since buffer memory to store the result of im2col will get very big, very fast. Thus also batched-gemm has very strict limits, since im2col result needs to be less than 2 GB in size (otherwise it will fail in 32 bit pointer Caffe and waste a lof of memory in 64 bit pointer Caffe, and fail after 4 GB size on AMD GPUs (no support for larger OpenCL buffers)). That's why CUDNN only uses a little buffer memory and LibDNN uses no buffer memory at all. You can also use LibDNN on your nVidia GPU, but it will be quite a bit slower than CUDNN in most cases (your mileage may vary). LibDNN can be used in both CUDA and OpenCL mode though, and LibDNN in CUDA mode is a bit faster than OpenCL mode.

gfursin commented 7 years ago

As for CK - it just provides high-level wrappers with JSON API around Caffe to unify building and running across Linux, Windows, MacOS and Android. We expose only a few parameters and leave the rest to the default build system. Current JSON params and associated CMake flags are here:

Note that you can override CK environment variables from command line with installing this package, i.e.


$ ck install package:lib-caffe-bvlc-opencl-libdnn-clblast-universal --env.USE_CLBLAST=0 --env.DISABLE_DEVICE_HOST_UNIFIED_MEMORY=ON 

We can similarly expose other knobs to control building and optimization of Caffe ...

abhi1212 commented 7 years ago

Thanks a lot @gfursin and @naibaf7