@Fabian : how to plug your convolutional layer into other frameworks?

hughperkins commented 8 years ago

Hi Fabian @naibaf7 how to plug your convolutional layer into other frameworks? What format does it expect the tensor data to be in? Can I just send in a cl_mem object in a certain format? Presumably I'd need to provide a bunch of metadata too, like dimension sizes, strides, offset? Do you have any benchmarks comparing performance on AMD hardware versus eg clBLAS + im2col ? And for any other hardware?

If your convolutional implementations are consistently better, seems no point in duplicating effort, I might just switch my libraries to use yours :-) Should be easy to build though, cross-platform etc, ideally...

naibaf7 commented 8 years ago

@hughperkins Thanks for your interest.

I do not provide benchmarks yet as the autotuning part is not completely done. Without Autotuning, it performs a little better than OpenCL im2col on AMD hardware, and about the same as CUDA im2col on nVidia hardware (with both OpenCL and CUDA backend). But since the kernels are built similarly to CLBLast, which claim up to 2x performance increase with Autotuning on general Hardware, I expect the same from Greentea-libDNN after I am done autotuning. My current experiments confirm or surpass that expectation though, given random convolution configurations to autotune.

As for using it in other frameworks:

It currently relies on ViennaCL, but this can be removed.
It requires the device-wrapper from OpenCL Caffe, which can be stripped down though to fit into other frameworks
For the parameters, look at the following struct for the initialization:

struct LibDNNConfig {
  LibDNNConfig() :
    in_shape(3, 1),
    out_shape(3, 1),
    kernel(1, 1),
    pad(1, 0),
    stride(1, 1),
    dilation(1, 0)
  {}
  device* dev_ptr = nullptr;
  std::vector<int_tp> in_shape;
  std::vector<int_tp> out_shape;
  std::vector<int_tp> kernel;
  std::vector<int_tp> pad;
  std::vector<int_tp> stride;
  std::vector<int_tp> dilation;
  int_tp group = 1;
  bool bias_term = false;
  bool fast_unsafe_math = false;
  bool weights_backward = true;
  bool bias_backward = true;
  libdnnConvolutionWeightAlgo_t wgalgo = LIBDNN_CONVOLUTION_WG_ALGO_DIRECT;
};

The framework is expected to fill out this struct and pass it to libDNN for initialization.

For forwarding/backwarding and tuning, the following parameters:

void Forward(const Dtype* bottom_data, const Dtype* weight,
               const Dtype* bias,
               Dtype* top_data, int_tp batch_size);

  void Backward(bool prop_down_data, bool prop_down_weights,
                const Dtype* top_data, const Dtype* top_diff,
                const Dtype* weight, Dtype* weight_diff,
                const Dtype* bias, Dtype* bias_diff,
                const Dtype* bottom_data, Dtype* bottom_diff,
                int_tp batch_size);

  void Tune(Dtype* top_data, Dtype* top_diff,
            Dtype* weight, Dtype* weight_diff,
            Dtype* bias, Dtype* bias_diff,
            Dtype* bottom_data, Dtype* bottom_diff,
            int_tp batch_size);

Here Dtype* is expected to be cl_mem (aka clmem*) on OpenCL platforms and device memory pointers on CUDA hardware.

hughperkins commented 8 years ago

Ok. Some questions:

It currently relies on ViennaCL, but this can be removed.

What does it rely on ViennaCL for? If it is removed, what are the implications?

It requires the device-wrapper from OpenCL Caffe, which can be stripped down though to fit into other frameworks

Presumably your framework works only on cl_mem's right? So you could factorize this interface I guess, to strip the device-independence wrapper, in the outer layer, then pass just the cl_mem and offset to the lower-level layer?

On a related topic, how to compile it? Does it need any fancy c++11 compilers and stuff? or can it be built on eg msvc10, gcc-4 ,and so on?

For the parameters, look at the following struct for the initialization

Ok. Couple of questions:

What is 'dilation'?
For shape, I'd be expecting a 4-dimensional shape, ie NCHW, or similar. Am I right in thinking the shape is just HW? Where are N and C sizes specified?
seems to be missing an 'offset'? But perhaps this is implicit in the DType object I guess?

Whilst I hate to give up my own conv optimization efforts, I seem to be not really doing much work on optimization lately, not having an AMD card :-P , and anyway, seems no point in having a zillion redundant amd/opencl conv implementations, so I'm reasonably up for just switching to yours, and then I can contribute to optimizing yours plausibly.

Oh, one othre question: cos my current frameworks all actually run on nvidia too, despite its opencl1.1-ness. Will I be able to use your conv layer on nvidia? Or should I have like if amd { use greentea; } else { use something-else; }? In the case that it works on nvidia et al now, to what extent is this just random chance, and to what extent do you guarantee to continue to support nvidia, Intel, etc in the future?

hughperkins commented 8 years ago

(Note: I'm not saying you should support anything other than AMD, I'm just posing the question. I reckon it's probably better to have one really super insanely fast conv library for AMD, than a kind of jack-of-all-trades slow-everywhere conv library :-P If you only support AMD, I'm fine with if...else..ing around it, as long as it's not totally pain to build against, on a non-AMD platform :-) )

hughperkins commented 8 years ago

(Thinking about how to make it easy for me to use your library on amd platforms, and build my library anyway on non-amd platforms, I'm kind of ... well, I'm jumping ahead, because I dont know your answers yet :-D , but ... :

imagine you provide your library as pre-built binaries, for whatever platforms you want to support, as dlls, sos, or however plugins work well on whichever platforms you want to support
users can download an installer, which will install your library to some well-defined location, somehow (I dont want to have to hunt around for it in a zillion locations :-) )
if the user selects an AMD device (for example), I can look for your library, in the well-defined location, and load it dynamically, and use it, if available )

naibaf7 commented 8 years ago

@hughperkins OK, let me try to address all your questions:

Memory offset for Dtype* is working on nVidia CUDA with my framework, but not yet OpenCL. I might add the offset parameter if a framework needs it.
The device* wrapper is mainly there for the following purposes:
- Allocating work memory for libDNN if needed
- Abstraction to get device parameters (max work group, preferred vector width) from both CUDA and OpenCL.
The format for the shapes is C,D,H,W or C,H,W... or C,D2,D1,H,W (there is no limit to spatial dimensions). The batch size N is dynamic at forward/backward-execution time.
C++11 is required, however if you really need and want, it could be changed to use boost instead. It won't work without those.
The kernels generated work from OpenCL 1.1 and upwards and CUDA 7.5 and upwards.
I am optimizing on Titan X, GTX 980, GT650M, Intel HD4000, AMD W9100, and the benchmarks I will release after the autotuner is finished will be on these platforms.
Dilation and grouping are additional convolution parameters that I think are at the moment specific to Caffe. While grouping is mainly for saving memory and reducing the parameter count while maintaining the same number of feature maps, dilation and its usages are described by the following papers (strided kernel covolutions == dilated convolutions):
ViennaCL and NVRTC are used to launch kernels. The OpenCL kernels could alternatively be launched with native OpenCL.
If you want to have libDNN as a shared library, then ViennaCL dependency should be of no concern (header-only) anyways.

The thing with dilated convolutions in all these applications is that they are not very efficient (in im2col GEMM) at the moment, using large amounts of convolution buffer memory. My kernels also attempt to solve that, as there is no cuDNN alternative to it yet.

hughperkins commented 8 years ago

C++11 is required, however if you really need and want, it could be changed to use boost instead. It won't work without those.

Ok. For linux this is probably ok. Actually I have two libraries:

DeepCL
cltorch

cltorch is only really targeted at linux and Mac, for 99.9% of users, I think. DeepCL is heavily targeted at Windows users, using python 2.7 and python 3.4. Therefore DeepCL needs to be buildable using msvc2008 and msvc2010. Or actually, I build it with msvc2010, and the python bit in msvc2008 or msvc2010, depending on python version, and link those together.

The windows bit, ie deepcl, sounds complicated, so we could put that to one side for now. However, you could perhaps think about providing Windows 32-bit and 64-bit binaries, in an automated sustainable way? Then, it would be easy for me to migrate deepcl to use greentea convolutions too.

And on the subject of 'greentea convolutions', I only really want the convolutional layer. I dont really want to bring the entirety of opencl caffe into my projects :-) To what extent could it be possible to factorize the convolutional implementation into eg a separate repo, so it is fairly lightweight to clone it and build it?

I am optimizing on Titan X, GTX 980, GT650M, Intel HD4000, AMD W9100, and the benchmarks I will release after the autotuner is finished will be on these platforms.

Ok sounds good. Sounds quite nvidia heavy actually... I think it'd be good to have an R9 Fury and an R9 390x in those benchmarks. I'm trying to persuade my AMD contact(s) to provide some for OpenCL/hcc library development, but no success so far :-)

Memory offset for Dtype* is working on nVidia CUDA with my framework, but not yet OpenCL. I might add the offset parameter if a framework needs it.

Hmmm. Are you saying your library handles both cuda and OpenCL? That sounds ... heavy.... I guess that if I feed it a cl_mem object, there's no way to get that to run convolutions in cudnn right? So, since my libraries are 100% opencl, i'd prefer there to be no hint of cuda stuff during build, ideally. I guess you have a simple flag to entirely disable cuda build is that right?

Wait... when you say you test on nvidia, you mean, you test the opencl version on nvidia right? I'm not interested in anything that can't handle receiving the data as a cl_mem. I think :-) ?

ViennaCL and NVRTC are used to launch kernels. The OpenCL kernels could alternatively be launched with native OpenCL

Ok.

The kernels generated work from OpenCL 1.1 and upwards and CUDA 6.5 and upwards

sounds good

The format for the shapes is C,D,H,W or C,H,W... or C,D2,D1,H,W (there is no limit to spatial dimensions). The batch size N is dynamic at forward/backward-execution time.

sounds good

Dilation and grouping are additional convolution parameters that I think are at the moment specific to Caffe

Ok. Actually, groups are generally available. At lesat, torch supports them. It's possible dilation is supported by torch too, and I just didnt notice yet :-)

hughperkins commented 8 years ago

PS Why arent you testing/optimizing on R9 Fury and R9-390x? If I can somehow persuade my AMD contact(s) to provide a box with an R9 Fury or four in, could that be useful to you? Not saying I can do that, but the more people who could use such a box to do useful work, the stronger the case I can make :-D

hughperkins commented 8 years ago

(By the way, I think it'd be good for you to factorize the opencl convolutional layer into a separate repo, out of caffe. so issues and discussions on it are all together. Otherwise we'll have to cram everything into enormous single threads, as happened earlier for your and Robert's pull requests, which became both a little crazy :-P )

Edit: oh, I already said this :-D

naibaf7 commented 8 years ago

@hughperkins Actually AMD provided my W9100s (similar to R9 390X but fast in "useless" 64bit (double) DNNs) but if you need details about that contact me privately on LinkedIn please, no reason to discuss this publicly. But yes, if they can provide us with (it can also be an IOMMU virtualized remote system with SSH logins) Fury/Polaris/Vega cards this would be in their interest as it means we tune the optimizations for upcoming hardware on their side.

The CUDA part can of course be disabled at compile time. Same for the OpenCL part. nVidia chips can both be tuned on OpenCL and CUDA kernels, however my tests so far show they are almost equal (but NVRTC/NVCC takes longer to compile than ViennaCL/OpenCL). Cross compiling is not very "heavy" as all it needs is some "#define" declarations to preprocess-replace OpenCL functions with CUDA ones and then compile and execute with NVRTC instead of ViennaCL (you can look this up in the existing source code, it's already there).

There will be stand-alone binaries and sources from Caffe, stripped of the unnecessary Caffe elements. But Caffe is a very neat framework to tune the kernels in initially.

The plan for development is the following:

Write general convolution forward kernel (done)
Write general convolution backward kernel (done)
Write general convoliution weight/bias-error kernel (done)
Provide public access to Greente libDNN via Caffe (done)
Write basic autotuner structure (done)
Specify tuning parameters for forward kernel (work in progress)
Specify tuning parameters for backward and weight/bias kernels (todo)
Autotune AlexNet kernels on mentioned GPUs (due to availability) and compare to baselines within Caffe (todo)
Make library standalone and provide binaries (todo), include pre-tuned configurations
Explore possibility of enabling CK (CollectiveKnowledge) interface (crowd tuning)
Explore distribution possibilities for Win32/64

Feel free to comment on if that seems reasonable :)

fsword73 commented 8 years ago

@naibaf7. Let us setup an email group https://github.com/hughperkins/DeepCL/issues/69

bhack commented 8 years ago

Yes a google group could be interesting on this thread also for OpenCV. /cc @mtamburrano @vpisarev

bhack commented 8 years ago

@naibaf7 @hughperkins I also add @edgarriba because probably we want to experiment libdnn engine in tinycnn during the gsoc

hughperkins commented 8 years ago

Hi Fabian,

Ok, sounds good. Please let me know when you reach each of the following two milestones:

Make library standalone and provide binaries (todo), include pre-tuned configurations
Explore distribution possibilities for Win32/64

The first one (plausibly without binaries actually) is a pre-requisite for switching cltorch to use your convolutional implementation. The second one is a pre-requisite for switching DeepCL to use your convolutional implementation.

Actually AMD provided my W9100s

Yes, I need something hosted though. A card without anywehre to put it is just extra weight to carry in my backpack :-D

But yes, if they can provide us with (it can also be an IOMMU virtualized remote system with SSH logins) Fury/Polaris/Vega cards this would be in their interest as it means we tune the optimizations for upcoming hardware on their side.

Yes, that's kind of what I would like :-)

Cross compiling is not very "heavy" as all it needs is some "#define" declarations to preprocess-replace OpenCL functions with CUDA ones and then compile and execute with NVRTC instead of ViennaCL

Ok

But Caffe is a very neat framework to tune the kernels in initially.

Sure. And I think Caffe is a good framework to tune the kernels in, both now and in the future. Having said that, for example, in Torch project, the code is factorized into multiple repos like:

https://github.com/torch/torch7 underlying tensor implementation for cpus/blas
https://github.com/torch/nn underlying neural network implementation, for cpus
https://github.com/torch/cutorch tensor implementation on CUDA
https://github.com/torch/cunn neural network implementation on CUDA

Having the code factorized like this makes mixing and matching very easy. It's one reason why porting torch to get alexnet working on OpenCL was only like somewhere around 6-12 weeks work :-)

fsword73 commented 8 years ago

I have completed the 1st optimized version of backPropWeights. Which is 500x faster than deepCL's original one. I am not CNN expert while I have 10 years' experience on GPU performance analysis and tuning. Please find it @ https://github.com/fsword73/CNNKernelPerfTest/blob/master/CL/test_kernel_backpropweights_fast.cl

hughperkins commented 8 years ago

Hi @fsword73

That sounds humblingly impressive :-) I'll try try to try this out this weekend, and report my results. By the way, is this optimization general to eg NVIDIA, or specific to AMD? I actually strongly prefer some optimizations that give really excellent performance on AMD, but not that useful elsewhere, but .... I only have NVIDIA cards to run on :-P So if it is general, I can try it easily, otherwise I will have to ... I'm not sure.... I've been poking AMD for like a year and a half to get like one single cloud AMD gpu, but, not getting very far so far :-P

hughperkins commented 8 years ago

Hi Fabian,

I'm writing a paper about cltorch, which I plan to throw onto arxiv. I have a section about how cltorch compares with other OpenCL implementations, which I'm citing your OpenCL caffe implementation in. Question, what algorithm(s) are you using? I believe originally you were using im2col + viennacl blas? And currently?

naibaf7 commented 8 years ago

The current version supports:

im2col + ViennaCL BLAS
im2col + clBLAS
im2col + CLBlast
im2col + cBLAS (open, atlas, mkl for OpenCL CPU)
libDNN direct convolution (implicit GEMM)
Intel spatial direct convolution
Intel FFT convolutions + clFFT

However I do not have up-to-date benchmarks of all variants at the moment... that'd be a huge effort to go through.

hughperkins commented 8 years ago

Whoa, cool :-O

Can you clarify which of these will be used when the data is stored in OpenCL buffers, ie referenced via cl_mem structs? For example, I guess that cBLAS is x86 CPU implementation, rather than running on a GPU, is that right?

libDNN direct convolution (implicit GEMM)

When you say 'direct convolution', do you mean that it is using im2col + gemm, but that the im2col bit is not carried out in serial with the gemm, but somehow at the same time as the gemm, so that the memory usage is much smaller than im2col + gemm. In any case, I havent explained it very well (probably because I dont fully understand it fully, if I'm fair), but the algorithm detailed in "cuDNN: Efficient Primitives for Deep Learning", is that right?

Intel spatial direct convolution Intel FFT convolutions + clFFT

What hardware do these target? It sounds like it is Intel-specific? is it only for Intel CPUs? Or also for Intel HD Integrated GPUs?

How does selection between these algorithms take place?

bhack commented 8 years ago

/cc @nyanp

naibaf7 commented 8 years ago

@hughperkins Currently, the algorithm selection sucks: If you are on Intel hardware, the Intel spatial kernel is enforced, however if you select to compile with clFFT, this one will be used. On other hardware, libDNN is used if you compile with it, otherwise it falls back to im2col + the BLAS library you choose to compile with. The convolution method ("engine") can alternatively be selected/overwritten in the network prototxt file.

Intel prepared the algorithms to also work on other hardware, however they are mainly targeted at Intel GPUs, such as Iris Pro.

im2col + cBLAS is targeted at CPUs, however it uses the network kernels (inclusive im2col) in OpenCL mode for more parallelism than standard Caffe. The memory (cl_mem structs) is mapped to host pointers in order to call the cBLAS functions.

"direct convolution (implicit GEMM)" is, as you point out, similar to the cuDNN paper, however I don't know how similar since I don't know their exact code. My version uses no additional memory in the current versions (atomic and direct weight update). I might do another version with reduction weight update, that will use a little extra memory.

bhack commented 8 years ago

Yes with all this backends an engine selection it is becoming quite strategic.

hughperkins commented 8 years ago

Fabian,

Cool. Thanks! :-) Question: for AMD hardware, and running say VGG A, which implementation(s) will tend to be used and/or produce the fastest timings?

By the way, in the current convnet-benchmarks timings for greentea, am I right in thinking that was using im2col + viennacl blas?

Hugh

naibaf7 commented 8 years ago

@hughperkins Yeah, those were using im2col + ViennaCL, the slowest possible combination by now.

@hughperkins I just released a new version with the finished atomic weight update kernel, GEMM vectorization and autotuner, if you want to benchmark it. I'll still have to look a bit further into nVidias VisualProfiler and AMDs CodeXL to see for further optimization possibilities, but it looks quite good so far.

I could run the convnet-benchmarks on a, sadly thermally limited, W9100 if you want and post it here? I can't do Titan X benchmarks right now and the GTX 980 is sadly not enough memory for the official benchmark.

naibaf7 commented 8 years ago

@hughperkins New kernel algorithms are available (for gradients, both data and weights):

typedef enum {
  // Stack the batch update into one GEMM block
  // (deterministic, 1 kernel call)
  // Serializes the batch and may therefore under use
  // the GPUs compute units.
  LIBDNN_CONVOLUTION_WG_ALGO_DIRECT        = 0,
  // Use multiple GEMM blocks in parallel and update weights atomically
  // (non deterministic, 1 kernel call, not supported on all devices)
  // Parallelizes the batch and has therefore higher GPU usage.
  LIBDNN_CONVOLUTION_WG_ALGO_ATOMIC        = 1
} libdnnConvolutionWeightAlgo_t;

typedef enum {
  // Transform data before GEMM (load, im2col, gemm, store)
  // This method is suitable for convolutions with similar
  // spatial input == output sizes, but can become inefficient
  // if input >> output (with large strides and kernels).
  LIBDNN_CONVOLUTION_BW_ALGO_IM2COL        = 0,
  // Transform data after GEMM (load, gemm, col2im, store)
  // Sometimes faster than im2col method, but uses
  // atomic operations and is not deterministic.
  LIBDNN_CONVOLUTION_BW_ALGO_COL2IM_ATOMIC = 1
} libdnnConvolutionBackwardAlgo_t;

I also asked @soumith to do the convnet benchmarks on it. The configuration is not autotuned for the Titan X yet but should give a general hint on if I'm on the right track or not.

hughperkins commented 8 years ago

Fabian,

Nice! Actually, I believe your convolutions are fast. The only thing I need really is to ensure that using your convolutional implementation wont make building my own frameworks really painful :-)

Actually, if you reckon it's possible for me to add your convolutions now, without much pain, one option could be to fork https://github.com/hughperkins/clnn , and show how it would look like with your convolutions slotted in.

Otherwise, I guess what I'd need is something like that the convolution implementation is:

in a standalone repo
can build without needing cuda etc, just opencl library as a dependency ideally (or at least: as few dependnecies as possible and/or the dependencies are easy to build/install, on as many platforms as possible)
accepts cl_mem objects
buildable easily on ubuntu 14.04 and Mac, with default compilers

I'm tempted to say it should accept tensors in the exact same format that torch provides, ie same layout, and with similar metadata (sizes, strides, offset). but, actually, the time to rearrange the tensor is probably small compared to convolution time? eg I think actually eg Scott Gray dim-shuffles the tensors around anyway for Winograd eg https://github.com/soumith/convnet-benchmarks/issues/93#issuecomment-192747077 . So I dont think the actual tensor layout and metadata of your library are too critical perhaps? But it should accept cl_mems, otherwise I have no (efficient) way of providing data to it :-)

In other news, for your implicit_gemm algorithm, as far as citing it in my cltorch paper, question: how does this work in terms of gemm? Do you have a custom gemm implementation? Or you use a blas underneath that? Or something else?

naibaf7 commented 8 years ago

@hughperkins We'll see about how fast (or not) they are when soumith benchmarks it (I made a PR there).

Yeah I can provide a standalone soon, maybe also with more tensor format input options.

The implicit GEMM is loosely based on GEMM samples from here:

and is also autotuneable together with additional convolution-specific parameters.

hughperkins commented 8 years ago

The implicit GEMM is loosely based on GEMM samples from [...] https://github.com/CNugteren/CLBlast

Ah, interesting. Thanks! :-)

Yeah I can provide a standalone soon, maybe also with more tensor format input options.

Ok, sounds good :-) I think the standalone bit is more important than the tensor input options. Copying stuff should be really fast compared to convolution.

and is also autotuneable together with additional convolution-specific parameters.

Sounds good :-)

bhack commented 8 years ago

Nice plan!

fmassa commented 8 years ago

@hughperkins FYI, Caffe and Torch have the same data layout, no need to dimshuffle.

hughperkins commented 8 years ago

@fmassa

True, but per the above, storageOffset is not currently supported. Which is ok, and I'm just clarifying that it's ok, and generalizing that even the layout being different would be ok, if this provides a performance benefit in the convolutional implementation.

naibaf7 commented 8 years ago

@hughperkins Different layouts can give a performance benefit, but they need different kernel parameters & autotuning than the default configuration I am working with now. So this would definitely be a feature to be added later. storageOffset can also influence memory alignment, and therefore reduce performance. But I will probably include it as a feature anyways.

bhack commented 8 years ago

BVLC / caffe

@Fabian : how to plug your convolutional layer into other frameworks? #4155