Closed vbkaisetsu closed 5 years ago
I don't see why this is necessary. the backpropagation wrt the input of the convolution is another convolution. the backpropagation wrt weights is naively just transposing the im2col matrix of the input image and perform gemm with the backpropagated derivatives. So this is already implemented.
//edit this follows directly from the convolution being a linear operation. Let us define p_i = patch(i,x) as the i-th row of the im2col matrix P(x) of image x. The convolution of the ith pixel for a single filter with weights w can thus be written as
f_i(x) = w^T patch(i,x) = w^T p_i
and the convolution wrt all pixels is f(x) = P(x) w
and the gradient wrt to w is nabla_w f_i(x) = p_i
Now, we can write backpropagation using a weight vector c as sum_i c_i d/dw f_i(x) = sum_i c_i p_i = c^T P(x)
so, no other matrix is needed.
Of course I know both dx/dy and dw/dy can be written via im2col, because the convolution operation is swappable. However, dx/dy takes a huge amount of memory.
Without col2im, dw/dy takes O(size(w) size(x)) and dx/dy takes O(size(x) size(x)). Normally, size(x) is much larger than size(w). If we use col2im, dx/dy only takes O(size(w) size(x*)).
We can find many conv2d implementations using im2col+col2im, e.g.: https://wiseodd.github.io/techblog/2016/07/16/convnet-conv-layer/
After a lot of thinking, I see. col2im is required for strides that are not 1. Otherwise this should be equivalent to carefully chosen convolutions.
is there any good material on efficient im2col implementations? The only one i found is https://github.com/strin/caffe-opencl/blob/master/src/caffe/greentea/cl_kernels/im2col.cl which should have the performance of naive convolution(i.e. bad).
I am still pretty sure that the way outlined in the links above is not an optimal way. and rephrasing the derivative as some convolution with simple post-processing should give some better results.
Assuming strides divide filter sizes and image sizes, for simplicity, I think strides can be handled by creating blocks of results, i.e. we transform the forward propagation filters into (channels stride_w stride_h) filters of sizes kernel_h/stride_h x kernel_w/stride_w x num_filters.With carefully chosen convolution parameters, we should take the derivatives we want to backpropagate and convolve them with this filters. In the end a post-processing step is needed to reorder the output of size size_h/stride_h x size_w / stride_w x (channels stride_w stride_h) into the desired size_h x size_w x channels. the w-coordinate is a trivial reshape, the size_h operation is a bit of work.
I have not a 100% bulletproof idea about dilation, but i think it should also map on dilation, but dilation+strides is a nightmare.
OpenCV col2im Just for reference. Haven't checked the performance yet.
Edit 1: @Ulfgard Updated col2im link. Please edit your previous comment (im2col => col2im)
we need col2im here. that function is at line 46 of the file i linked. col2im is much more complicated, because it is essentially a cross-correlation (sorry i wrote im2col yesterday...)
Here is my implementation of col2im.
This implementation uses GCD and LCM to reduce if
in the loop.
This code has a bug. See below.
inline int grid_ceil(const int x, const int step) {
return x > 0 ? ((x - 1) / step + 1) * step : x / step * step;
}
/*
NOTE: stride_h * stride_bez_h + dilation_h * dilation_bez_h = gcd_h = GCD(stride_h, dilation_h)
stride_w * stride_bez_w + dilation_w * dilation_bez_w = gcd_w = GCD(stride_w, dilation_w)
We can solve stride_bez_{h,w} and dilation_bez_{h,w} using the Euclidean algorithm.
*/
kernel void col2im_kernel(
const int input_h, const int input_w, const int channels,
const int output_h, const int output_w,
const int kernel_h, const int kernel_w,
const int pad_h, const int pad_w,
const int stride_h, const int stride_w,
const int dilation_h, const int dilation_w,
const int stride_bez_h, const int stride_bez_w,
const int dilation_bez_h, const int dilation_bez_w,
const int gcd_h, const int gcd_w,
const global float *col_buffer, const int col_offset,
__global float *im_buffer, const int im_offset) {
const int x_x = get_global_id(0);
const int x_y = ((int) get_global_id(1)) % input_h;
const int channel = ((int) get_global_id(1)) / input_h;
const int col_channel_shift = channel * kernel_w * kernel_h * output_h * output_w + col_offset;
const int x_channel_shift = channel * input_h * input_w + im_offset;
const int gcd_scale_h = (x_y + pad_h) / gcd_h;
const int t_y_step = stride_h * dilation_h / gcd_h;
const int t_y_begin = grid_ceil(max(-stride_bez_h * gcd_scale_h * stride_h,
(dilation_bez_h * gcd_scale_h - kernel_h + 1) * dilation_h),
t_y_step);
const int t_y_end = min((output_h - stride_bez_h * gcd_scale_h) * stride_h,
(dilation_bez_h * gcd_scale_h + 1) * dilation_h);
const int gcd_scale_w = (x_x + pad_w) / gcd_w;
const int t_x_step = stride_w * dilation_w / gcd_w;
const int t_x_begin = grid_ceil(max(-stride_bez_w * gcd_scale_w * stride_w,
(dilation_bez_w * gcd_scale_w - kernel_w + 1) * dilation_w),
t_x_step);
const int t_x_end = min((output_w - stride_bez_w * gcd_scale_w) * stride_w,
(dilation_bez_w * gcd_scale_w + 1) * dilation_w);
if (x_x < input_w && channel < channels) {
float val = 0;
for (int t_y = t_y_begin; t_y < t_y_end; t_y += t_y_step) {
for (int t_x = t_x_begin; t_x < t_x_end; t_x += t_x_step) {
const int w_y = -t_y / dilation_h + dilation_bez_h * gcd_scale_h;
const int y_y = t_y / stride_h + stride_bez_h * gcd_scale_h;
const int w_x = -t_x / dilation_w + dilation_bez_w * gcd_scale_w;
const int y_x = t_x / stride_w + stride_bez_w * gcd_scale_w;
val += col_buffer[col_channel_shift
+ (w_x + w_y * kernel_w) * output_h * output_w
+ y_y * output_w
+ y_x];
}
}
im_buffer[x_channel_shift + x_y * input_w + x_x] += val;
}
}
Performance test results:
Intel® HD Graphics 620 Kaby Lake GT2 (beignet) image 512x512, kernel 16x16, 500 iters padding 8x8
stride_w |
dilation_w |
w/o GCD | w/ GCD |
---|---|---|---|
1 | 1 | 18.8 sec | 12.7 sec |
2 | 1 | 15.5 sec | 6.7 sec |
2 | 2 | 15.2 sec | 11.3 sec |
2 | 3 | 15.1 sec | 6.5 sec |
2 | 4 | 14.8 sec | 10.1 sec |
@vbkaisetsu To run neural nets on mobile phones with OpenCL support, you can checkout MACE, which supports Adreno, Mali and PowerVR GPUs. Here are some benchmark results.
@llhe Thank you very much. MACE contains many kernel codes. I am not looking for other tools to run NNs on mobile platforms. We are just writing a new toolkit called primitiv. We can use MACE as a reference now.
@vbkaisetsu I think your parameters are too many, such as stride_bez_h, const int stride_bez_w, const int dilation_bez_h, const int dilation_bez_w, const int gcd_h, const int gcd_w and the following implementation can be ref.. https://github.com/AlexeyAB/darknet/blob/master/src/col2im_kernels.cu https://github.com/BVLC/caffe/blob/master/src/caffe/util/im2col.cu
@TaihuLight darknet does not consider dilations, so I ported caffe's implementation to OpenCL, and compared it with mine:
Intel® HD Graphics 620 Kaby Lake GT2 (Intel(R) OpenCL Platform) image: 512x512, kernel: 16x16, padding: 8x8, 500 iters.
stride_{w,h} |
dilation_{w,h} |
caffe | My implementation |
---|---|---|---|
1 | 1 | 15576 ms | 12510 ms |
1 | 2 | 20419 ms | 11317 ms |
1 | 3 | 26323 ms | 10819 ms |
1 | 4 | 33018 ms | 10114 ms |
2 | 1 | 4032 ms | 3290 ms |
2 | 2 | 6369 ms | 3010 ms |
2 | 3 | 7835 ms | 3072 ms |
2 | 4 | 9532 ms | 2564 ms |
3 | 1 | 2033 ms | 1670 ms |
3 | 2 | 3106 ms | 1644 ms |
3 | 3 | 4268 ms | 1298 ms |
3 | 4 | 4898 ms | 1455 ms |
The previous code has a bug, so I fixed it. The full source code is here: https://gist.github.com/vbkaisetsu/a98299df827f9a5245635f646c1d94be
My implementation has a lot of arguments related to GCD/LCM, but it is faster.
^^ Is it an idea to move this discussion elsewhere? I don't see too much CLBlast related stuff going on for now, right? Of course we can still keep this issue open of course regarding adding col2im to the library.
Okay. Should I send a pull request for col2im?
@CNugteren CLblast has implemented im2col, but col2im does not. I think this issue is useful for implementing col2im API for CLblast. @vbkaisetsu Could you implement col2im with the same API with xIM2COL in CLblast, then pull request it?
template <typename T>
StatusCode Im2col(const size_t channels, const size_t height, const size_t width, const size_t kernel_h, const size_t kernel_w, const size_t pad_h, const size_t pad_w, const size_t stride_h, const size_t stride_w, const size_t dilation_h, const size_t dilation_w,
const cl_mem im_buffer, const size_t im_offset,
cl_mem col_buffer, const size_t col_offset,
cl_command_queue* queue, cl_event* event)
@vbkaisetsu You also can provide batched col2im kernel for CLblast API by adding const size_t batch_count parameter to col2im as follow:
template <typename T>
StatusCode Im2col(const size_t channels, const size_t height, const size_t width, const size_t kernel_h, const size_t kernel_w, const size_t pad_h, const size_t pad_w, const size_t stride_h, const size_t stride_w, const size_t dilation_h, const size_t dilation_w,
const cl_mem im_buffer, const size_t im_offset,
cl_mem col_buffer, const size_t col_offset,
const size_t batch_count,
cl_command_queue* queue, cl_event* event)
Ah ok, makes sense. What I can do is lay the groundwork for col2im (API, test, etc) and then you can afterwards make a PR for the kernel, ok?
@TaihuLight, @CNugteren ok.
I know it has been a while, my apologies. However, I have managed to get the groundwork in place for col2im in the CLBlast-270-col2im
branch. I actually even put in @vbkaisetsu's kernel code, but it is not 100% ready, I still need your input. It seems I'm not 100% sure of the algorithm, since I didn't expect any additions being done, I would expect it to be the inverse of im2col, but it isn't apparently.
@vbkaisetsu: Could you have a look at the code and make a PR to fix/improve it? Most notably in these two places:
Do not hesitate to ask me if you have questions. I'm also happy to continue working on it myself if you don't have the time, let me know. But I hope there is not much work, just some small fixes required, ideally from someone who understands the algorithm :smiley:
Goal is to get the clblast_test_xcol2im
test to run without errors.
Thanks a lot for your contribution. Now that the PR is merged in, I have fixed a last issue in the tests related to the half-precision use-case. That works now on my machine, maybe you can also test with the latest CLBlast-270-col2im
branch?
I'll do some final checks myself as well in the coming days on some devices, and then it is ready to be merged into master.
I checked the latest tests in my environment. Looks good to me. Thanks!
In the current implementation, I used the assignment operator to overwrite im_buffer
as follows:
im_buffer[input_index + im_offset] = val;
However, in my considered opinion, the addition assignment operator is better:
im_buffer[input_index + im_offset] += val;
This function may be used for the back-propagation step of convolutional neural networks. In that case, the latter is useful for accumulating errors.
How do you think about it?
I'm not an expert nor potential user of this function, so if you think it is more useful that way? In theory both use-cases could be handled with either a simple add-kernel afterwards or a zero-initialisation of a buffer beforehand. Probably the second is easier/cheaper, thus favoring the += val
option.
That would make im_buffer
both an input as well as an output. A bit strange given the name perhaps, but if it is the most common use-case and if we document it right.... feel free to make another PR to the branch.
I am in favour of += as this can potentially save memory.
I will try to incorporate this API as soon as it is included in a release :)
OK! I just made the changes as you requested in https://github.com/CNugteren/CLBlast/commit/6f67525ea693d0761c479b060c04ce93d408beb5
I've done some tests on various devices, it seems all good. I'll make a PR to merge the branch into master, and will do so after you approve of the above change.
Since this is merged into master now (thanks to @vbkaisetsu), I'm closing this issue.
This is a feature request.
I am writing a neural net toolkit. Now we can write a convolution function using im2col and GEMM functions, but col2im is also necessary to calculate the back propagation.
Could you support col2im?