CNugteren / CLBlast

Tuned OpenCL BLAS
Apache License 2.0
1.06k stars 204 forks source link

Add col2im function #270

Closed vbkaisetsu closed 5 years ago

vbkaisetsu commented 6 years ago

This is a feature request.

I am writing a neural net toolkit. Now we can write a convolution function using im2col and GEMM functions, but col2im is also necessary to calculate the back propagation.

Could you support col2im?

Ulfgard commented 6 years ago

I don't see why this is necessary. the backpropagation wrt the input of the convolution is another convolution. the backpropagation wrt weights is naively just transposing the im2col matrix of the input image and perform gemm with the backpropagated derivatives. So this is already implemented.

//edit this follows directly from the convolution being a linear operation. Let us define p_i = patch(i,x) as the i-th row of the im2col matrix P(x) of image x. The convolution of the ith pixel for a single filter with weights w can thus be written as

f_i(x) = w^T patch(i,x) = w^T p_i

and the convolution wrt all pixels is f(x) = P(x) w

and the gradient wrt to w is nabla_w f_i(x) = p_i

Now, we can write backpropagation using a weight vector c as sum_i c_i d/dw f_i(x) = sum_i c_i p_i = c^T P(x)

so, no other matrix is needed.

vbkaisetsu commented 6 years ago

Of course I know both dx/dy and dw/dy can be written via im2col, because the convolution operation is swappable. However, dx/dy takes a huge amount of memory.

Without col2im, dw/dy takes O(size(w) size(x)) and dx/dy takes O(size(x) size(x)). Normally, size(x) is much larger than size(w). If we use col2im, dx/dy only takes O(size(w) size(x*)).

We can find many conv2d implementations using im2col+col2im, e.g.: https://wiseodd.github.io/techblog/2016/07/16/convnet-conv-layer/

Ulfgard commented 6 years ago

After a lot of thinking, I see. col2im is required for strides that are not 1. Otherwise this should be equivalent to carefully chosen convolutions.

Ulfgard commented 6 years ago

is there any good material on efficient im2col implementations? The only one i found is https://github.com/strin/caffe-opencl/blob/master/src/caffe/greentea/cl_kernels/im2col.cl which should have the performance of naive convolution(i.e. bad).

I am still pretty sure that the way outlined in the links above is not an optimal way. and rephrasing the derivative as some convolution with simple post-processing should give some better results.

Assuming strides divide filter sizes and image sizes, for simplicity, I think strides can be handled by creating blocks of results, i.e. we transform the forward propagation filters into (channels stride_w stride_h) filters of sizes kernel_h/stride_h x kernel_w/stride_w x num_filters.With carefully chosen convolution parameters, we should take the derivatives we want to backpropagate and convolve them with this filters. In the end a post-processing step is needed to reorder the output of size size_h/stride_h x size_w / stride_w x (channels stride_w stride_h) into the desired size_h x size_w x channels. the w-coordinate is a trivial reshape, the size_h operation is a bit of work.

I have not a 100% bulletproof idea about dilation, but i think it should also map on dilation, but dilation+strides is a nightmare.

sivagnanamn commented 6 years ago

OpenCV col2im Just for reference. Haven't checked the performance yet.

Edit 1: @Ulfgard Updated col2im link. Please edit your previous comment (im2col => col2im)

Ulfgard commented 6 years ago

we need col2im here. that function is at line 46 of the file i linked. col2im is much more complicated, because it is essentially a cross-correlation (sorry i wrote im2col yesterday...)

vbkaisetsu commented 6 years ago

Here is my implementation of col2im. This implementation uses GCD and LCM to reduce if in the loop.

This code has a bug. See below.

inline int grid_ceil(const int x, const int step) {
  return x > 0 ? ((x - 1) / step + 1) * step : x / step * step;
}

/*
  NOTE: stride_h * stride_bez_h + dilation_h * dilation_bez_h = gcd_h = GCD(stride_h, dilation_h)
        stride_w * stride_bez_w + dilation_w * dilation_bez_w = gcd_w = GCD(stride_w, dilation_w)
  We can solve stride_bez_{h,w} and dilation_bez_{h,w} using the Euclidean algorithm. 
 */
kernel void col2im_kernel(
    const int input_h, const int input_w, const int channels,
    const int output_h, const int output_w,
    const int kernel_h, const int kernel_w,
    const int pad_h, const int pad_w,
    const int stride_h, const int stride_w,
    const int dilation_h, const int dilation_w,
    const int stride_bez_h, const int stride_bez_w,
    const int dilation_bez_h, const int dilation_bez_w,
    const int gcd_h, const int gcd_w,
    const global float *col_buffer, const int col_offset,
    __global float *im_buffer, const int im_offset) {

  const int x_x = get_global_id(0);
  const int x_y = ((int) get_global_id(1)) % input_h;
  const int channel = ((int) get_global_id(1)) / input_h;
  const int col_channel_shift = channel * kernel_w * kernel_h * output_h * output_w + col_offset;
  const int x_channel_shift = channel * input_h * input_w + im_offset;

  const int gcd_scale_h = (x_y + pad_h) / gcd_h;
  const int t_y_step = stride_h * dilation_h / gcd_h;
  const int t_y_begin = grid_ceil(max(-stride_bez_h * gcd_scale_h * stride_h,
                                      (dilation_bez_h * gcd_scale_h - kernel_h + 1) * dilation_h),
                                  t_y_step);
  const int t_y_end = min((output_h - stride_bez_h * gcd_scale_h) * stride_h,
                          (dilation_bez_h * gcd_scale_h + 1) * dilation_h);

  const int gcd_scale_w = (x_x + pad_w) / gcd_w;
  const int t_x_step = stride_w * dilation_w / gcd_w;
  const int t_x_begin = grid_ceil(max(-stride_bez_w * gcd_scale_w * stride_w,
                                      (dilation_bez_w * gcd_scale_w - kernel_w + 1) * dilation_w),
                                  t_x_step);
  const int t_x_end = min((output_w - stride_bez_w * gcd_scale_w) * stride_w,
                          (dilation_bez_w * gcd_scale_w + 1) * dilation_w);

  if (x_x < input_w && channel < channels) {
    float val = 0;
    for (int t_y = t_y_begin; t_y < t_y_end; t_y += t_y_step) {
      for (int t_x = t_x_begin; t_x < t_x_end; t_x += t_x_step) {
        const int w_y = -t_y / dilation_h + dilation_bez_h * gcd_scale_h;
        const int y_y = t_y / stride_h + stride_bez_h * gcd_scale_h;
        const int w_x = -t_x / dilation_w + dilation_bez_w * gcd_scale_w;
        const int y_x = t_x / stride_w + stride_bez_w * gcd_scale_w;
        val += col_buffer[col_channel_shift
                            + (w_x + w_y * kernel_w) * output_h * output_w
                            + y_y * output_w
                            + y_x];
      }
    }
    im_buffer[x_channel_shift + x_y * input_w + x_x] += val;
  }
}

Performance test results:

Intel® HD Graphics 620 Kaby Lake GT2 (beignet) image 512x512, kernel 16x16, 500 iters padding 8x8

stride_w dilation_w w/o GCD w/ GCD
1 1 18.8 sec 12.7 sec
2 1 15.5 sec 6.7 sec
2 2 15.2 sec 11.3 sec
2 3 15.1 sec 6.5 sec
2 4 14.8 sec 10.1 sec
llhe commented 6 years ago

@vbkaisetsu To run neural nets on mobile phones with OpenCL support, you can checkout MACE, which supports Adreno, Mali and PowerVR GPUs. Here are some benchmark results.

vbkaisetsu commented 6 years ago

@llhe Thank you very much. MACE contains many kernel codes. I am not looking for other tools to run NNs on mobile platforms. We are just writing a new toolkit called primitiv. We can use MACE as a reference now.

TaihuLight commented 6 years ago

@vbkaisetsu I think your parameters are too many, such as stride_bez_h, const int stride_bez_w, const int dilation_bez_h, const int dilation_bez_w, const int gcd_h, const int gcd_w and the following implementation can be ref.. https://github.com/AlexeyAB/darknet/blob/master/src/col2im_kernels.cu https://github.com/BVLC/caffe/blob/master/src/caffe/util/im2col.cu

vbkaisetsu commented 6 years ago

@TaihuLight darknet does not consider dilations, so I ported caffe's implementation to OpenCL, and compared it with mine:

Intel® HD Graphics 620 Kaby Lake GT2 (Intel(R) OpenCL Platform) image: 512x512, kernel: 16x16, padding: 8x8, 500 iters.

stride_{w,h} dilation_{w,h} caffe My implementation
1 1 15576 ms 12510 ms
1 2 20419 ms 11317 ms
1 3 26323 ms 10819 ms
1 4 33018 ms 10114 ms
2 1 4032 ms 3290 ms
2 2 6369 ms 3010 ms
2 3 7835 ms 3072 ms
2 4 9532 ms 2564 ms
3 1 2033 ms 1670 ms
3 2 3106 ms 1644 ms
3 3 4268 ms 1298 ms
3 4 4898 ms 1455 ms

The previous code has a bug, so I fixed it. The full source code is here: https://gist.github.com/vbkaisetsu/a98299df827f9a5245635f646c1d94be

My implementation has a lot of arguments related to GCD/LCM, but it is faster.

CNugteren commented 6 years ago

^^ Is it an idea to move this discussion elsewhere? I don't see too much CLBlast related stuff going on for now, right? Of course we can still keep this issue open of course regarding adding col2im to the library.

vbkaisetsu commented 6 years ago

Okay. Should I send a pull request for col2im?

TaihuLight commented 6 years ago

@CNugteren CLblast has implemented im2col, but col2im does not. I think this issue is useful for implementing col2im API for CLblast. @vbkaisetsu Could you implement col2im with the same API with xIM2COL in CLblast, then pull request it?

template <typename T>
StatusCode Im2col(const size_t channels, const size_t height, const size_t width, const size_t kernel_h, const size_t kernel_w, const size_t pad_h, const size_t pad_w, const size_t stride_h, const size_t stride_w, const size_t dilation_h, const size_t dilation_w,
                  const cl_mem im_buffer, const size_t im_offset,
                  cl_mem col_buffer, const size_t col_offset,
                  cl_command_queue* queue, cl_event* event)

@vbkaisetsu You also can provide batched col2im kernel for CLblast API by adding const size_t batch_count parameter to col2im as follow:

template <typename T>
StatusCode Im2col(const size_t channels, const size_t height, const size_t width, const size_t kernel_h, const size_t kernel_w, const size_t pad_h, const size_t pad_w, const size_t stride_h, const size_t stride_w, const size_t dilation_h, const size_t dilation_w,
                  const cl_mem im_buffer, const size_t im_offset,
                  cl_mem col_buffer, const size_t col_offset,
                  const size_t batch_count,
                  cl_command_queue* queue, cl_event* event)
CNugteren commented 6 years ago

Ah ok, makes sense. What I can do is lay the groundwork for col2im (API, test, etc) and then you can afterwards make a PR for the kernel, ok?

vbkaisetsu commented 6 years ago

@TaihuLight, @CNugteren ok.

CNugteren commented 6 years ago

I know it has been a while, my apologies. However, I have managed to get the groundwork in place for col2im in the CLBlast-270-col2im branch. I actually even put in @vbkaisetsu's kernel code, but it is not 100% ready, I still need your input. It seems I'm not 100% sure of the algorithm, since I didn't expect any additions being done, I would expect it to be the inverse of im2col, but it isn't apparently.

@vbkaisetsu: Could you have a look at the code and make a PR to fix/improve it? Most notably in these two places:

Do not hesitate to ask me if you have questions. I'm also happy to continue working on it myself if you don't have the time, let me know. But I hope there is not much work, just some small fixes required, ideally from someone who understands the algorithm :smiley:

Goal is to get the clblast_test_xcol2im test to run without errors.

CNugteren commented 5 years ago

Thanks a lot for your contribution. Now that the PR is merged in, I have fixed a last issue in the tests related to the half-precision use-case. That works now on my machine, maybe you can also test with the latest CLBlast-270-col2im branch?

I'll do some final checks myself as well in the coming days on some devices, and then it is ready to be merged into master.

vbkaisetsu commented 5 years ago

I checked the latest tests in my environment. Looks good to me. Thanks!

vbkaisetsu commented 5 years ago

In the current implementation, I used the assignment operator to overwrite im_buffer as follows: im_buffer[input_index + im_offset] = val;

However, in my considered opinion, the addition assignment operator is better: im_buffer[input_index + im_offset] += val;

This function may be used for the back-propagation step of convolutional neural networks. In that case, the latter is useful for accumulating errors.

How do you think about it?

CNugteren commented 5 years ago

I'm not an expert nor potential user of this function, so if you think it is more useful that way? In theory both use-cases could be handled with either a simple add-kernel afterwards or a zero-initialisation of a buffer beforehand. Probably the second is easier/cheaper, thus favoring the += val option.

That would make im_buffer both an input as well as an output. A bit strange given the name perhaps, but if it is the most common use-case and if we document it right.... feel free to make another PR to the branch.

Ulfgard commented 5 years ago

I am in favour of += as this can potentially save memory.

I will try to incorporate this API as soon as it is included in a release :)

CNugteren commented 5 years ago

OK! I just made the changes as you requested in https://github.com/CNugteren/CLBlast/commit/6f67525ea693d0761c479b060c04ce93d408beb5

I've done some tests on various devices, it seems all good. I'll make a PR to merge the branch into master, and will do so after you approve of the above change.

CNugteren commented 5 years ago

Since this is merged into master now (thanks to @vbkaisetsu), I'm closing this issue.