LibDNN Tune SegFault - Githubissues

edgarriba commented 8 years ago

@naibaf7 During LibDNN Tune I got the following seg fault error http://pastebin.com/faWZZUcR

If you want to check the codes https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/tiny_cnn/core/kernels/conv2d_op_libdnn.h#L67

edgarriba commented 8 years ago

/cc @bhack

naibaf7 commented 8 years ago

@edgarriba Ok I will look into it. Meanwhile, what performance numbers do you get with the default kernel configuration?

edgarriba commented 8 years ago

@naibaf7 After calling Forward() I read back data from memory and all I get are zeros https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/tiny_cnn/core/kernels/conv2d_op_libdnn.h#L137-L145

Is there any way to check that kernel compilation is fine? Or what other checking could I do?

naibaf7 commented 8 years ago

@edgarriba Oh okay, if the tuning fails and the forward as well, then it seems like the kernel has not been compiled at all. In OpenCL mode, https://github.com/naibaf7/libdnn/blob/master/src/libdnn.cpp#L1624 would be the crticial point.

After that line, you can use this code to dump the PTX code and check if it is generated correctly:

  size_t bin_sz;
  clGetProgramInfo(ocl_program_.handle().get(), CL_PROGRAM_BINARY_SIZES, sizeof(size_t), &bin_sz, NULL);
  unsigned char *bin = (unsigned char *)malloc(bin_sz);  // NOLINT
  clGetProgramInfo(ocl_program_.handle().get(), CL_PROGRAM_BINARIES, sizeof(unsigned char *), &bin, NULL);
  FILE* fp = fopen("libdnn_conv_opencl.ptx", "wb");
  fwrite(bin, sizeof(char), bin_sz, fp);
  fclose(fp);
  free(bin);  // NOLINT

Another way to get debug output for all ViennaCL related stuff is to define VIENNACL_DEBUG_ALL either with -DVIENNACL_DEBUG_ALL when compiling LibDNN or defining VIENNACL_DEBUG_ALL in the libdnn common header. This will report back all error regarding kernel launch, device initialization and kernel compile.

edgarriba commented 8 years ago

ok, let me check.

Besides, kernel compilation is called in greentea::LibDNNConv constructor, right? Or do we need to trigger something else?

naibaf7 commented 8 years ago

No, your interfacing code looks good to me :)

edgarriba commented 8 years ago

@naibaf7

After that line, you can use this code to dump the PTX code and check if it is generated correctly

Here I get an non-human readable file that starts saying that it was compiled with NVIDIA compiler and actually the flag USE_CUDA was OFF during LibDNN compilation. http://pastebin.com/iQrXH0V9

Another way to get debug output for all ViennaCL related stuff is to define

During LibDNN CMake setup it raises an error saying that this flag does not exist.

naibaf7 commented 8 years ago

@edgarriba USE_CUDA OFF is fine if you only use OpenCL.

The generated PTX code (it's a kind of portable GPU assembly) looks correct, this is what is expected to happen. nVidia compiles OpenCL through NVVM/system driver (which you should update by the way, 352.93 is a bit outdated).

So that means the code fails during the forward pass, and not during compilation. The most likely reason for this is an issue with the allocated device memory.

edgarriba commented 8 years ago

The allocation I would say that it's okay since if I try to read the values from input device memory instead of output I get the correct values. The casting stuff? Now we do the following: cl_mem -> void* -> float_t*

naibaf7 commented 8 years ago

@edgarriba Yes the memory objects seem to be valid, otherwise ViennaCL would complain about an invalid object. The question is if the size of the objects is also correct.

What is line 90 doing? Over what is it looping? https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/tiny_cnn/core/kernels/conv2d_op_libdnn.h#L90

It seems like you split the data into batches of size 1? While that should not cause issues here, the performance you are getting will be very sub-par, since the kernels in LibDNN gain speed by increasing the amount of parallel operations over the whole batch (and decreasing the number of kernel launches).

So, are the input data and kernel weights valid on the GPU after line 103/104?

What would help me a lot would be if you can post all the parameters you use:

Size of in_data, W, bias, out_data on the GPU memory
Parameters pad, stride, kernel size, dilation

I'm also confused about in_padded and in. In LibDNN, the padding is implicit. Meaning the memory object should have pad = 0 if you pad the data explicitly yourself. If you use pad > 0, then the memory object that you input to the kernel should be unpadded.

Quite important that in_shape (L232), out_shape (L233) and pad (L234) line up with what the memory objects actually look like.

edgarriba commented 8 years ago

@naibaf7 Sizes are the following

Size of buffer dev_in is 100 bytes Size of buffer dev_W is 72 bytes Size of buffer dev_bias is 8 bytes Size of buffer dev_out is 72 bytes

naibaf7 commented 8 years ago

@edgarriba And the kernel parameters (see updated comment above)?

edgarriba commented 8 years ago

you mean inside the Forward(...) call or what variables ?

naibaf7 commented 8 years ago

https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/tiny_cnn/core/kernels/conv2d_op_libdnn.h#L201-L240

edgarriba commented 8 years ago

(gdb) print config $1 = {dev_ptr = 0x1131ae8, in_shape = std::vector of length 4, capacity 4 = {1, 1, 5, 5}, out_shape = std::vector of length 4, capacity 4 = {1, 2, 3, 3}, kernel = std::vector of length 2, capacity 2 = {3, 3}, pad = std::vector of length 2, capacity 2 = {0, 0}, stride = std::vector of length 2, capacity 2 = {1, 1}, dilation = std::vector of length 2, capacity 2 = {1, 1}, group = 1, bias_term = true, fast_unsafe_math = false, weights_backward = false, bias_backward = false, wgalgo = greentea::LIBDNN_CONVOLUTION_WG_ALGO_ATOMIC, bwalgo = greentea::LIBDNN_CONVOLUTION_BW_ALGO_COL2IM_ATOMIC}

edgarriba commented 8 years ago

Do we need any preprocessing like im2col?

naibaf7 commented 8 years ago

So:

dev_in is 100 bytes, 1,1,5,5,4, that seems correct
dev_out is 72 bytes, 1,2,3,3,4, also correct
out dimensions smaller by 2,2 spatially, good.

No preprocessing is needed, it actually looks good. It should work as it is.

You can try to use the deterministic kernels and see if it makes a difference: https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/tiny_cnn/core/kernels/conv2d_op_libdnn.h#L251 put this if-condition to "false", so that

config.wgalgo = greentea::LIBDNN_CONVOLUTION_WG_ALGO_DIRECT;
config.bwalgo = greentea::LIBDNN_CONVOLUTION_BW_ALGO_IM2COL;

this would not use the atomic operators. Should not be an issue, since I verified these kernels on a GTX 980, but just to be sure... oh, just realized these are just relevant on the backward pass, not forward. hmmmm...

edgarriba commented 8 years ago

same happens

naibaf7 commented 8 years ago

yeah not relevant (see updated comment), since already forward fails. It's really strange that it seems to fail without error message from ViennaCL, the CUDA driver and also no seg fault... at least one of these things should happen. Do you have any other OpenCL device to test on?

edgarriba commented 8 years ago

do you have any synchronization method ? I mean something like this https://github.com/CNugteren/CLCudaAPI/blob/master/samples/simple.cc#L132

naibaf7 commented 8 years ago

@edgarriba No. You can call it on the CLCudaAPI though if you want. But as soon as you trigger the transfer of the output from GPU back to CPU, all kernels that have access on that memory object will finish first anyways.

naibaf7 commented 8 years ago

@edgarriba On which branch are you testing/working currently? I need to test this bug myself.

edgarriba commented 8 years ago

https://github.com/edgarriba/tiny-cnn/tree/opencl-ops-patch

you probably need to change platform_id and device_id (in this order) https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/test/test_core.h#L113

  cmake -DBUILD_TESTS=ON -DUSE_OPENCL=ON -DUSE_LIBDNN=ON ..
  make && ./test/tiny_cnn_test

thx man!

naibaf7 commented 8 years ago

@edgarriba Can't compile it, CL/cl2.hpp missing and picotest/picotest.h missing. Where do you get these from?

bhack commented 8 years ago

Picotest is a submodule in the repository. cl2.hpp I don't think it is required anymore after clcudaapi introduction.

edgarriba commented 8 years ago

Picotest is a submodule

 git submodule update --init

For ocl headers, you can comment it here (will update with CLCudaAPi) https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/tiny_cnn/util/util.h#L55 https://github.com/edgarriba/tiny-cnn/blob/opencl-ops-patch/tiny_cnn/util/util.h#L543-572

bhack commented 8 years ago

The cl2.hpp it is fixed with https://github.com/edgarriba/tiny-cnn/commit/69791e17733baab965870e26833d40e4c8f2a5ed

naibaf7 commented 8 years ago

@bhack great, I'll try again.

naibaf7 commented 8 years ago

@edgarriba I've been able to pinpoint the issue. I will make a PR showing up the problem(s). Besides that, when I compute by hand I get different "expected results" from the convolution as you do, so please re-check that as well.

edgarriba / tiny-cnn

LibDNN Tune SegFault #3