Open edgarriba opened 8 years ago
/cc @bhack
@edgarriba Ok I will look into it. Meanwhile, what performance numbers do you get with the default kernel configuration?
@naibaf7
After calling Forward()
I read back data from memory and all I get are zeros
https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/tiny_cnn/core/kernels/conv2d_op_libdnn.h#L137-L145
Is there any way to check that kernel compilation is fine? Or what other checking could I do?
@edgarriba Oh okay, if the tuning fails and the forward as well, then it seems like the kernel has not been compiled at all. In OpenCL mode, https://github.com/naibaf7/libdnn/blob/master/src/libdnn.cpp#L1624 would be the crticial point.
After that line, you can use this code to dump the PTX code and check if it is generated correctly:
size_t bin_sz;
clGetProgramInfo(ocl_program_.handle().get(), CL_PROGRAM_BINARY_SIZES, sizeof(size_t), &bin_sz, NULL);
unsigned char *bin = (unsigned char *)malloc(bin_sz); // NOLINT
clGetProgramInfo(ocl_program_.handle().get(), CL_PROGRAM_BINARIES, sizeof(unsigned char *), &bin, NULL);
FILE* fp = fopen("libdnn_conv_opencl.ptx", "wb");
fwrite(bin, sizeof(char), bin_sz, fp);
fclose(fp);
free(bin); // NOLINT
Another way to get debug output for all ViennaCL related stuff is to define
VIENNACL_DEBUG_ALL
either with -DVIENNACL_DEBUG_ALL
when compiling LibDNN or defining VIENNACL_DEBUG_ALL
in the libdnn common header.
This will report back all error regarding kernel launch, device initialization and kernel compile.
ok, let me check.
Besides, kernel compilation is called in greentea::LibDNNConv
constructor, right?
Or do we need to trigger something else?
No, your interfacing code looks good to me :)
@naibaf7
After that line, you can use this code to dump the PTX code and check if it is generated correctly
Here I get an non-human readable file that starts saying that it was compiled with NVIDIA compiler and actually the flag USE_CUDA
was OFF
during LibDNN compilation.
http://pastebin.com/iQrXH0V9
Another way to get debug output for all ViennaCL related stuff is to define
During LibDNN CMake setup it raises an error saying that this flag does not exist.
@edgarriba USE_CUDA OFF is fine if you only use OpenCL.
The generated PTX code (it's a kind of portable GPU assembly) looks correct, this is what is expected to happen. nVidia compiles OpenCL through NVVM/system driver (which you should update by the way, 352.93 is a bit outdated).
So that means the code fails during the forward pass, and not during compilation. The most likely reason for this is an issue with the allocated device memory.
The allocation I would say that it's okay since if I try to read the values from input device memory instead of output I get the correct values. The casting stuff? Now we do the following: cl_mem -> void* -> float_t*
@edgarriba Yes the memory objects seem to be valid, otherwise ViennaCL would complain about an invalid object. The question is if the size of the objects is also correct.
What is line 90 doing? Over what is it looping? https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/tiny_cnn/core/kernels/conv2d_op_libdnn.h#L90
It seems like you split the data into batches of size 1? While that should not cause issues here, the performance you are getting will be very sub-par, since the kernels in LibDNN gain speed by increasing the amount of parallel operations over the whole batch (and decreasing the number of kernel launches).
So, are the input data and kernel weights valid on the GPU after line 103/104?
What would help me a lot would be if you can post all the parameters you use:
I'm also confused about in_padded
and in
. In LibDNN, the padding is implicit. Meaning the memory object should have pad = 0 if you pad the data explicitly yourself. If you use pad > 0, then the memory object that you input to the kernel should be unpadded.
Quite important that in_shape (L232), out_shape (L233) and pad (L234) line up with what the memory objects actually look like.
@naibaf7 Sizes are the following
Size of buffer dev_in is 100 bytes Size of buffer dev_W is 72 bytes Size of buffer dev_bias is 8 bytes Size of buffer dev_out is 72 bytes
@edgarriba And the kernel parameters (see updated comment above)?
you mean inside the Forward(...) call or what variables ?
(gdb) print config $1 = {dev_ptr = 0x1131ae8, in_shape = std::vector of length 4, capacity 4 = {1, 1, 5, 5}, out_shape = std::vector of length 4, capacity 4 = {1, 2, 3, 3}, kernel = std::vector of length 2, capacity 2 = {3, 3}, pad = std::vector of length 2, capacity 2 = {0, 0}, stride = std::vector of length 2, capacity 2 = {1, 1}, dilation = std::vector of length 2, capacity 2 = {1, 1}, group = 1, bias_term = true, fast_unsafe_math = false, weights_backward = false, bias_backward = false, wgalgo = greentea::LIBDNN_CONVOLUTION_WG_ALGO_ATOMIC, bwalgo = greentea::LIBDNN_CONVOLUTION_BW_ALGO_COL2IM_ATOMIC}
Do we need any preprocessing like im2col?
So:
No preprocessing is needed, it actually looks good. It should work as it is.
You can try to use the deterministic kernels and see if it makes a difference: https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/tiny_cnn/core/kernels/conv2d_op_libdnn.h#L251 put this if-condition to "false", so that
config.wgalgo = greentea::LIBDNN_CONVOLUTION_WG_ALGO_DIRECT;
config.bwalgo = greentea::LIBDNN_CONVOLUTION_BW_ALGO_IM2COL;
this would not use the atomic operators. Should not be an issue, since I verified these kernels on a GTX 980, but just to be sure... oh, just realized these are just relevant on the backward pass, not forward. hmmmm...
same happens
yeah not relevant (see updated comment), since already forward fails. It's really strange that it seems to fail without error message from ViennaCL, the CUDA driver and also no seg fault... at least one of these things should happen. Do you have any other OpenCL device to test on?
do you have any synchronization method ? I mean something like this https://github.com/CNugteren/CLCudaAPI/blob/master/samples/simple.cc#L132
@edgarriba No. You can call it on the CLCudaAPI though if you want. But as soon as you trigger the transfer of the output from GPU back to CPU, all kernels that have access on that memory object will finish first anyways.
@edgarriba On which branch are you testing/working currently? I need to test this bug myself.
https://github.com/edgarriba/tiny-cnn/tree/opencl-ops-patch
you probably need to change platform_id
and device_id
(in this order)
https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/test/test_core.h#L113
cmake -DBUILD_TESTS=ON -DUSE_OPENCL=ON -DUSE_LIBDNN=ON ..
make && ./test/tiny_cnn_test
thx man!
@edgarriba
Can't compile it, CL/cl2.hpp
missing and picotest/picotest.h
missing. Where do you get these from?
Picotest is a submodule in the repository. cl2.hpp I don't think it is required anymore after clcudaapi introduction.
Picotest is a submodule
git submodule update --init
For ocl headers, you can comment it here (will update with CLCudaAPi) https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/tiny_cnn/util/util.h#L55 https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/tiny_cnn/util/util.h#L543-572
The cl2.hpp it is fixed with https://github.com/edgarriba/tiny-cnn/commit/69791e17733baab965870e26833d40e4c8f2a5ed
@bhack great, I'll try again.
@edgarriba I've been able to pinpoint the issue. I will make a PR showing up the problem(s). Besides that, when I compute by hand I get different "expected results" from the convolution as you do, so please re-check that as well.
@naibaf7 During LibDNN Tune I got the following seg fault error http://pastebin.com/faWZZUcR
If you want to check the codes https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/tiny_cnn/core/kernels/conv2d_op_libdnn.h#L67