Runtest fails - core dump [opencl]

YutaOtsuka commented 8 years ago

Hi, I'm trying to use OpenCL-Caffe on GeForce GTX TITAN X. I could pass "make all" and "make test". But I had a "segmentation error" in "make runtest".My environment is ubuntu14.04, GeForece GTX TITAN X and OpenCL 1.2.I didn't change the Makefile.config. I'm using viennacl-dev. What do you think?

[ RUN      ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3
*** Aborted at 1463757287 (unix time) try "date -d @1463757287" if you are using GNU date ***
PC: @     0x2b9f008cf172 caffe::ConvolutionLayerSpatial<>::timed_convolve()
*** SIGSEGV (@0x161) received by PID 4204 (TID 0x2b9efe33d3c0) from PID 353; stack trace: ***
    @     0x2b9f01350d40 (unknown)
    @     0x2b9f008cf172 caffe::ConvolutionLayerSpatial<>::timed_convolve()
    @     0x2b9f008cb1d8 caffe::ConvolutionLayerSpatial<>::setup_convolution()
    @     0x2b9f008ce47a caffe::ConvolutionLayerSpatial<>::Forward_gpu()
    @           0x47d922 caffe::Layer<>::Forward()
    @           0x4e9149 caffe::ConvolutionLayerTest_Spatial_TestSimpleConvolution_Spatial3x3_Test<>::TestBody_Impl()
    @           0x8ccd63 testing::internal::HandleExceptionsInMethodIfSupported<>()
    @           0x8c5b87 testing::Test::Run()
    @           0x8c5c2e testing::TestInfo::Run()
    @           0x8c5d35 testing::TestCase::Run()
    @           0x8c5fd8 testing::internal::UnitTestImpl::RunAllTests()
    @           0x8c6277 testing::UnitTest::Run()
    @           0x46e335 main
    @           0x2b9f0133bec5 (unknown)
    @           0x478829 (unknown)
    @           0x0 (unknown)
make: *** [runtest] Segmentation fault (core dumped)

naibaf7 commented 8 years ago

Duplicate; see here: https://github.com/BVLC/caffe/issues/4179

naibaf7 commented 8 years ago

Oh ok sorry, I see you already use ViennaCL-DEV; In that case we must ask @gongzg from Intel if he knows what could cause the issue.

This layer will not be used on actual networks though, if you can run the other tests, except for the ConvolutionLayerSpatial, it's fine.

YutaOtsuka commented 8 years ago

I could pass other run-tests but I couldn't execute actual classification. So I thought not passing runtest was the problem.

Loading file: ../../../pictures/101_ObjectCategories/airplanes/image_0001.jpg
Classifying 1 inputs.
ViennaCL: FATAL ERROR: Could not find kernel 'fillbuffer_float' from program ''
Number of kernels in program: 0
std::exception
Segmentation fault (core dumped)

naibaf7 commented 8 years ago

@YutaOtsuka Ok, that's interesting, let's see then:

./build/test/test_all.testbin --gtest_filter=*OpenCLKernelCompileTest* 0
clinfo
./build/tools/caffe device_query

These outputs might give us a hint as to what is going on. I myself have a GTX 980 with the latest driver which works well in OpenCL mode, so the Titan X should be no different.

YutaOtsuka commented 8 years ago

I executed your command. It was following.

I0520 15:07:56.841437  5367 common.cpp:373] Total devices: 1
I0520 15:07:56.841604  5367 common.cpp:374] CUDA devices: 0
I0520 15:07:56.841610  5367 common.cpp:375] OpenCL devices: 1
I0520 15:07:56.841615  5367 common.cpp:399] Device id:                     0
I0520 15:07:56.841620  5367 common.cpp:401] Device backend:                OpenCL
I0520 15:07:56.841631  5367 common.cpp:403] Backend details:               NVIDIA Corporation: OpenCL 1.2 CUDA 7.5.23
I0520 15:07:56.841637  5367 common.cpp:405] Device vendor:                 NVIDIA Corporation
I0520 15:07:56.841681  5367 common.cpp:407] Name:                          GeForce GTX TITAN X
I0520 15:07:56.841718  5367 common.cpp:409] Total global memory:           12884705280

naibaf7 commented 8 years ago

@YutaOtsuka What about the other commands?

YutaOtsuka commented 8 years ago

Oh, sorry. It's ./build/test/test_all.testbin --gtest_filter=OpenCLKernelCompileTest 0 Note: Google Test filter = OpenCLKernelCompileTest

[==========] Running 0 tests from 0 test cases.
[==========] 0 tests from 0 test cases ran. (0 ms total)
[  PASSED  ] 0 tests.

It's clinfo.

Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 1.2 CUDA 7.5.23
  Platform Name:                 NVIDIA CUDA
  Platform Vendor:               NVIDIA Corporation
  Platform Extensions:               cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts 

  Platform Name:                 NVIDIA CUDA
Number of devices:               1
  Device Type:                   CL_DEVICE_TYPE_GPU
  Device ID:                     4318
  Max compute units:                 24
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               64
  Max work group size:               1024
  Preferred vector width char:           1
  Preferred vector width short:          1
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          1
  Native vector width short:             1
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               1215Mhz
  Address bits:                  64
  Max memory allocation:             3221176320
  Image support:                 Yes
  Max number of images read arguments:       256
  Max number of images write arguments:      16
  Max image 2D width:                16384
  Max image 2D height:               16384
  Max image 3D width:                4096
  Max image 3D height:               4096
  Max image 3D depth:                4096
  Max samplers within kernel:            32
  Max size of kernel argument:           4352
  Alignment (bits) of base address:      4096
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               128
  Cache size:                    393216
  Global memory size:                12884705280
  Constant buffer size:              65536
  Max number of constant args:           9
  Local memory type:                 Local
  Local memory size:                 49152
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1000
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue properties:              
    Out-of-Order:                Yes
    Profiling :                  Yes
  Platform ID:                   0x252b560
  Name:                      GeForce GTX TITAN X
  Vendor:                    NVIDIA Corporation
  Device OpenCL C version:           OpenCL C 1.2 
  Driver version:                352.68
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 1.2 CUDA
  Extensions:                    cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts  cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64

naibaf7 commented 8 years ago

@YutaOtsuka There is an error in executing ./build/test/test_all.testbin --gtest_filter=*OpenCLKernelCompileTest* 0 Note that there must be *-s around the filter keyword.

YutaOtsuka commented 8 years ago

It was this. thank you.

Setting to use device 0
Note: Google Test filter = *OpenCLKernelCompileTest*
[==========] Running 2 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 1 test from OpenCLKernelCompileTest/0, where TypeParam = float
[ RUN      ] OpenCLKernelCompileTest/0.TestCompile
Kernel bundle: activation: OK
Kernel bundle: auxiliary: OK
Kernel bundle: batch_reindex: OK
Kernel bundle: benchmark: OK
Kernel bundle: bias: OK
Kernel bundle: bnll: OK
Kernel bundle: channel: OK
Kernel bundle: concat: OK
Kernel bundle: contrastive_loss: OK
Kernel bundle: conv_layer_spatial: OK
Kernel bundle: crop: OK
Kernel bundle: dropout: OK
Kernel bundle: eltwise: OK
Kernel bundle: elu: OK
Kernel bundle: embed: OK
Kernel bundle: fft: OK
Kernel bundle: fillbuffer: OK
Kernel bundle: im2col: OK
Kernel bundle: im2col_nd: OK
Kernel bundle: lrn: OK
Kernel bundle: math: OK
Kernel bundle: mergecrop: OK
Kernel bundle: pooling: OK
Kernel bundle: pooling_nd: OK
Kernel bundle: pooling_sk: OK
Kernel bundle: slice: OK
Kernel bundle: softmax_loss: OK
Kernel bundle: solvers: OK
Kernel bundle: tile: OK
[       OK ] OpenCLKernelCompileTest/0.TestCompile (8 ms)
[----------] 1 test from OpenCLKernelCompileTest/0 (8 ms total)

[----------] 1 test from OpenCLKernelCompileTest/1, where TypeParam = double
[ RUN      ] OpenCLKernelCompileTest/1.TestCompile
Kernel bundle: activation: OK
Kernel bundle: auxiliary: OK
Kernel bundle: batch_reindex: OK
Kernel bundle: benchmark: OK
Kernel bundle: bias: OK
Kernel bundle: bnll: OK
Kernel bundle: channel: OK
Kernel bundle: concat: OK
Kernel bundle: contrastive_loss: OK
Kernel bundle: conv_layer_spatial: OK
Kernel bundle: crop: OK
Kernel bundle: dropout: OK
Kernel bundle: eltwise: OK
Kernel bundle: elu: OK
Kernel bundle: embed: OK
Kernel bundle: fft: OK
Kernel bundle: fillbuffer: OK
Kernel bundle: im2col: OK
Kernel bundle: im2col_nd: OK
Kernel bundle: lrn: OK
Kernel bundle: math: OK
Kernel bundle: mergecrop: OK
Kernel bundle: pooling: OK
Kernel bundle: pooling_nd: OK
Kernel bundle: pooling_sk: OK
Kernel bundle: slice: OK
Kernel bundle: softmax_loss: OK
Kernel bundle: solvers: OK
Kernel bundle: tile: OK
[       OK ] OpenCLKernelCompileTest/1.TestCompile (8 ms)
[----------] 1 test from OpenCLKernelCompileTest/1 (8 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test cases ran. (16 ms total)
[  PASSED  ] 2 tests.

naibaf7 commented 8 years ago

@YutaOtsuka So it looks like everything is fine. That's odd. What does your classification code look like? Share as much as possible.

YutaOtsuka commented 8 years ago

I just used provided classify.py like this.

   python ../../python/classify.py --model_def ./VGG_ILSVRC_16_layers_deploy.prototxt
                               --pretrained_model ./VGG_ILSVRC_16_layers.caffemodel                            
                               --gpu 
                               --raw_scale 255 ../../../pictures/101_ObjectCategories/airplanes/image_0001.jpg 
                               ./result.npy

naibaf7 commented 8 years ago

@YutaOtsuka Ah yes, the PyCaffe code did not initialize the OpenCL GPU correctly. I fixed it now. The downside of that PyCaffe code is that it only works with the first OpenCL/CUDA device present, which is a bit stupid but oh well, at least it should work for you now.

YutaOtsuka commented 8 years ago

Can I get new pycaffe code?

naibaf7 commented 8 years ago

@YutaOtsuka Just pull the latest version of the OpenCL-Caffe repository.

YutaOtsuka commented 8 years ago

It worked correctly. Thank you very much!

naibaf7 commented 8 years ago

@YutaOtsuka I wonder why you'd use OpenCL-Caffe instead of CUDA-Caffe on a Titan X though. It is quite a bit slower still.

YutaOtsuka commented 8 years ago

I just wanted to analyze the OpenCL movement. If possible I'd like to use it in FPGA. I'm just thinking.

naibaf7 commented 8 years ago

@YutaOtsuka Sounds good. Yes big improvements on speed are in progress.

buaapengbo commented 8 years ago

Hi, i meet the same problems in opencl caffe with Nvidia GTX970. I mv test_convolution_layer_spatial.cpp to test_convolution_layer_spatial.log , then:

make clean make all make test make runtest

and I got this:

[----------] Global test environment tear-down [==========] 2034 tests from 274 test cases ran. (480505 ms total) [ PASSED ] 2033 tests. [ FAILED ] 1 test, listed below: [ FAILED ] NetTest/0.TestSharedWeightsUpdate, where TypeParam = caffe::CPUDevice<float>

Then I try to train a LeNet following http://caffe.berkeleyvision.org/gathered/examples/mnist.html I just add a line in layer conv1 of "examples/mnist/lenet_train_test.prototxt":

engine: SPATIAL

when i train the net with ./examples/mnist/train_lenet.sh

i got this:

I0622 14:02:52.293876 2826 solver.cpp:111] Creating training net from net file: examples/mnist/lenet_train_test.prototxt [libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 52:5: Unknown enumeration value of "SPATIAL" for field "engine". F0622 14:02:52.294034 2826 upgrade_proto.cpp:79] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: examples/mnist/lenet_train_test.prototxt * Check failure stack trace: * @ 0x7faeca694daa (unknown) @ 0x7faeca694ce4 (unknown) @ 0x7faeca6946e6 (unknown) @ 0x7faeca697687 (unknown) @ 0x7faecaa92ebe caffe::ReadNetParamsFromTextFileOrDie() @ 0x7faecaac99cb caffe::Solver<>::InitTrainNet() @ 0x7faecaac9eb6 caffe::Solver<>::Init() @ 0x7faecaaca1c6 caffe::Solver<>::Solver() @ 0x7faecaac3c03 caffe::Creator_SGDSolver<>() @ 0x415c07 caffe::SolverRegistry<>::CreateSolver() @ 0x40f323 train() @ 0x40cd6c main @ 0x7faec979cf45 (unknown) @ 0x40d56b (unknown) @ (nil) (unknown) Aborted (core dumped)

what do you think?

naibaf7 commented 8 years ago

@buaapengbo That the shared weights update test fails is fine. You can ignore it. The SPATIAL engine is mainly for intel chips, and I believe the correct identifier is INTEL_SPATIAL now, see here for the available engines:

  enum Engine {
    DEFAULT = 0;
    CAFFE = 1;
    CUDNN = 2;
    LIBDNN = 3;
    INTEL_SPATIAL = 4;
    FFT = 5;
  }

And you have to enable USE_INTEL_SPATIAL := 1in the Makefile.

this engine is mainly for Intel GPUs though. Use the CAFFE, DEFAULT or LIBDNN for most nVidia and AMD chips.

buaapengbo commented 8 years ago

@naibaf7 Thank you for your quickly reply! I understand the test TestSharedWeightsUpdate is unnecessary.

I just use Intel CPU i7-6700 and Nvidia GTX970, is this mean that I donot need to set USE_INTEL_SPATIAL := 1 in the Makefile.config?

I set engine: CAFFE in the prototxt file and train:

./examples/mnist/train_lenet.sh

I got these messages:

I0622 14:32:23.470811 3920 solver.cpp:251] Iteration 800, loss = 0.216637 I0622 14:32:23.470849 3920 solver.cpp:267] Train net output #0: loss = 0.216637 (* 1 = 0.216637 loss) I0622 14:32:23.470856 3920 sgd_solver.cpp:112] Iteration 800, lr = 0.00943913 I0622 14:32:27.685243 3920 solver.cpp:251] Iteration 900, loss = 0.154349 I0622 14:32:27.685279 3920 solver.cpp:267] Train net output #0: loss = 0.154349 (* 1 = 0.154349 loss) I0622 14:32:27.685287 3920 sgd_solver.cpp:112] Iteration 900, lr = 0.00937411 I0622 14:32:31.858384 3920 solver.cpp:479] Snapshotting to binary proto file examples/mnist/lenet_iter_1000.caffemodel I0622 14:32:31.869387 3920 sgd_solver.cpp:323] Snapshotting solver state to binary proto file examples/mnist/lenet_iter_1000.solverstate I0622 14:32:31.909396 3920 solver.cpp:341] Iteration 1000, loss = 0.0869865 I0622 14:32:31.909431 3920 solver.cpp:362] Iteration 1000, Testing net (#0) I0622 14:32:35.642645 3920 solver.cpp:429] Test net output #0: accuracy = 0.981 I0622 14:32:35.642683 3920 solver.cpp:429] Test net output #1: loss = 0.0592155 (* 1 = 0.0592155 loss) I0622 14:32:35.642691 3920 solver.cpp:346] Optimization Done. I0622 14:32:35.642696 3920 caffe.cpp:249] Optimization Done.

I have a question, did I used the Opencl nvidia?

naibaf7 commented 8 years ago

@buaapengbo If you have disabled the CUDA backend in the Makefile.config or set the -gpu flag to 1 instead of 0 (when enabling the CUDA backend as well), then yes, you used OpenCL.

You can also test which devices will be used with this command: ./build/tools/caffe device_query

If you have only OpenCL it will be: 0: GTX 970 OpenCL

If you have OpenCL and CUDA enabled: 0: GTX 970 CUDA 1: GTX 970 OpenCL

If you have installed OpenCL SDK by Intel, then the i7-6700 will also show up. If you have installed beignet OpenCL & enabled the i7-6700 iGPU then this will also show up.

buaapengbo commented 8 years ago

@naibaf7 Thank you for your apply.

My Makefile.config is:

# USE_CUDA := 1 USE_GREENTEA := 1

I also run the command:

./build/tools/caffe device_query

got this:

pengbo@FPGA-Accel-Server:~/cnns/git/caffe$ ./build/tools/caffe device_query I0623 11:07:02.536579 5995 common.cpp:373] Total devices: 1 I0623 11:07:02.536739 5995 common.cpp:374] CUDA devices: 0 I0623 11:07:02.536747 5995 common.cpp:375] OpenCL devices: 1 I0623 11:07:02.536752 5995 common.cpp:399] Device id: 0 I0623 11:07:02.536757 5995 common.cpp:401] Device backend: OpenCL I0623 11:07:02.536769 5995 common.cpp:403] Backend details: NVIDIA Corporation: OpenCL 1.2 CUDA 7.5.18 I0623 11:07:02.536777 5995 common.cpp:405] Device vendor: NVIDIA Corporation I0623 11:07:02.536806 5995 common.cpp:407] Name: GeForce GTX 970 I0623 11:07:02.536837 5995 common.cpp:409] Total global memory: 4294770688

I haven't install OpenCL SDK by Intel or beignet OpenCL. Is this means I do use the OpenCL of GTX970 but donot use CUDA of GTX970 ?

and if I enable CUDA backend, I can use the command:

./examples/mnist/train_lenet.sh -gpu0 to use device 0(default is CUDA) to train and test .

./examples/mnist/train_lenet.sh -gpu1 to use device 1(default is OpenCL) to train and test

I understand, thank you very much!

naibaf7 commented 8 years ago

@buaapengbo Exactly, you got that right :) cool, isn't it? You can also try to compile with USE_LIBDNN, which should give better performance for OpenCL and CUDA. It's slower than cuDNN but faster than cuBLAS/clBLAS/ViennaCL.

buaapengbo commented 8 years ago

@naibaf7 That's very cool ! I modify my Makefile.config:

USE_CUDA := 1 USE_GREENTEA := 1

then

make clean make all make test make runtest

.build_release/tools/caffe device_query

I got this:

pengbo@FPGA-Accel-Server:~/cnns/git/caffe$ .build_release/tools/caffe device_query I0623 13:30:33.994971 14615 common.cpp:373] Total devices: 2 I0623 13:30:33.995152 14615 common.cpp:374] CUDA devices: 1 I0623 13:30:33.995160 14615 common.cpp:375] OpenCL devices: 1 I0623 13:30:33.995340 14615 common.cpp:382] Device id: 0 I0623 13:30:33.995348 14615 common.cpp:384] Device backend: CUDA I0623 13:30:33.995368 14615 common.cpp:386] Backend details: CUDA I0623 13:30:33.995373 14615 common.cpp:388] Device vendor: NVIDIA Corporation I0623 13:30:33.995376 14615 common.cpp:390] Name: GeForce GTX 970 I0623 13:30:33.995381 14615 common.cpp:392] Total global memory: 4294770688 I0623 13:30:33.995391 14615 common.cpp:399] Device id: 1 I0623 13:30:33.995398 14615 common.cpp:401] Device backend: OpenCL I0623 13:30:33.995410 14615 common.cpp:403] Backend details: NVIDIA Corporation: OpenCL 1.2 CUDA 7.5.18 I0623 13:30:33.995417 14615 common.cpp:405] Device vendor: NVIDIA Corporation I0623 13:30:33.995440 14615 common.cpp:407] Name: GeForce GTX 970 I0623 13:30:33.995471 14615 common.cpp:409] Total global memory: 4294770688

OK, I can train the net with -gpu0 with CUDA and -gpu1 with OpenCL!

Thank you very much for your help! @naibaf7

ubergarm commented 7 years ago

I was getting the same error ViennaCL: FATAL ERROR: Could not find kernel 'fillbuffer_float' from program '' despite having passing Caffe tests shown above etc.

Explicitly setting the device in the python code e.g. caffe.set_device(0) fixed the problem in my case.

Repo: BVLC/caffe Branch: opencl Commit: 72edcdc

Thanks!

naibaf7 commented 7 years ago

@ubergarm Yes OpenCL Caffe requires explicit device initialization and cannot default to the primary device like CUDA Caffe. The test code does call caffe.set_device(x), where x is the command line number passed to the test suite.

dohai90 commented 7 years ago

@naibaf7 Hello, I got errors with opencl-caffe while runtest as below: ViennaCL: FATAL ERROR: Could not find kernel 'im2col_float' from program '' Number of kernels in program: 0 unknown file: Failure C++ exception with description "Kernel not found" thrown in the test body. [ FAILED ] ConvolutionLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice (9 ms) [ RUN ] ConvolutionLayerTest/2.TestDilatedGradient ViennaCL: FATAL ERROR: Could not find kernel 'im2col_float' from program '' Number of kernels in program: 0 unknown file: Failure C++ exception with description "Kernel not found" thrown in the test body. [ FAILED ] ConvolutionLayerTest/2.TestDilatedGradient, where TypeParam = caffe::GPUDevice (8 ms) [ RUN ] ConvolutionLayerTest/2.TestGradient3D ViennaCL: FATAL ERROR: Could not find kernel 'im2col_nd_float' from program '' Number of kernels in program: 0 unknown file: Failure C++ exception with description "Kernel not found" thrown in the test body. [ FAILED ] ConvolutionLayerTest/2.TestGradient3D, where TypeParam = caffe::GPUDevice (11 ms) [ RUN ] ConvolutionLayerTest/2.Test1x1Gradient F0217 11:45:11.096109 10309 syncedmem.cpp:278] Check failed: mapped_ptr == cpuptr (0 vs. 0x10f8000) Device claims it support zero copy but failed to create correct user ptr buffer Check failure stack trace: @ 0xb6b0686e google::LogMessage::Fail() @ 0xb6b07e6a google::LogMessage::SendToLog() @ 0xb6b0651c google::LogMessage::Flush() @ 0xb6b0848c google::LogMessageFatal::~LogMessageFatal() @ 0xb6e6f664 caffe::SyncedMemory::gpu_data() @ 0xb6cf96fc caffe::Blob<>::gpu_diff() @ 0xb6e9f384 caffe::ConvolutionLayer<>::Backward_gpu() @ 0x4a762a caffe::Layer<>::Backward() @ 0x4da944 caffe::GradientChecker<>::CheckGradientExhaustive() @ 0x4edf2c caffe::ConvolutionLayerTest_Test1x1Gradient_Test<>::TestBody_Impl() @ 0x7afe80 testing::internal::HandleExceptionsInMethodIfSupported<>() @ 0x7ab5fc testing::Test::Run() @ 0x7ab708 testing::TestInfo::Run() @ 0x7ab7b6 testing::TestCase::Run() @ 0x7ac5a4 testing::internal::UnitTestImpl::RunAllTests() @ 0x7ac816 testing::UnitTest::Run() @ 0x462752 main @ 0xb613e8aa __libc_start_main Aborted

And I used some commands as you recommended above. And the results are:

clinfo Number of platforms 1 Platform Name ARM Platform Platform Vendor ARM Platform Version OpenCL 1.2 v1.r14p0-01rel0.0fe2d25ca074016740f8ab3fb451b151 Platform Profile FULL_PROFILE Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_khr_image2d_from_buffer cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory Platform Extensions function suffix ARM

Platform Name ARM Platform Number of devices 2 Device Name Mali-T628 Device Vendor ARM Device Vendor ID 0x6200010 Device Version OpenCL 1.2 v1.r14p0-01rel0.0fe2d25ca074016740f8ab3fb451b151 Driver Version 1.2 Device OpenCL C Version OpenCL C 1.2 v1.r14p0-01rel0.0fe2d25ca074016740f8ab3fb451b151 Device Type GPU Device Profile FULL_PROFILE Max compute units 4 Max clock frequency 600MHz Device Partition (core) Max number of sub-devices 0 Supported partition types None Max work item dimensions 3 Max work item sizes 256x256x256 Max work group size 256 Preferred work group size multiple 4 Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 8 / 8 (cl_khr_fp16) float 4 / 4
double 2 / 2 (cl_khr_fp64) Half-precision Floating-point support (cl_khr_fp16) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations No Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations No Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations No Address bits 64, Little-Endian Global memory size 2086998016 (1.944GiB) Error Correction support No Max memory allocation 521749504 (497.6MiB) Unified memory for Host and Device Yes Minimum alignment for any data type 128 bytes Alignment of base address 1024 bits (128 bytes) Global Memory cache type Read/Write Global Memory cache size <printDeviceInfo:89: get CL_DEVICE_GLOBAL_MEM_CACHE_SIZE : error -30> Global Memory cache line 64 bytes Image support Yes Max number of samplers per kernel 16 Max size for 1D images from buffer 65536 pixels Max 1D or 2D image array size 2048 images Base address alignment for 2D image buffers 32 bytes Pitch alignment for 2D image buffers 16 bytes Max 2D image size 65536x65536 pixels Max 3D image size 65536x65536x65536 pixels Max number of read image args 128 Max number of write image args 8 Local memory type Global Local memory size 32768 (32KiB) Max constant buffer size 65536 (64KiB) Max number of constant args 8 Max size of kernel argument 1024 Queue properties
Out-of-order execution Yes Profiling Yes Prefer user sync for interop No Profiling timer resolution 1000ns Execution capabilities
Run OpenCL kernels Yes Run native kernels No printf() buffer size 1048576 (1024KiB) Built-in kernels
Device Available Yes Compiler Available Yes Linker Available Yes Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_khr_image2d_from_buffer cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory

Device Name Mali-T628 Device Vendor ARM Device Vendor ID 0x6200010 Device Version OpenCL 1.2 v1.r14p0-01rel0.0fe2d25ca074016740f8ab3fb451b151 Driver Version 1.2 Device OpenCL C Version OpenCL C 1.2 v1.r14p0-01rel0.0fe2d25ca074016740f8ab3fb451b151 Device Type GPU Device Profile FULL_PROFILE Max compute units 2 Max clock frequency 600MHz Device Partition (core) Max number of sub-devices 0 Supported partition types None Max work item dimensions 3 Max work item sizes 256x256x256 Max work group size 256 Preferred work group size multiple 4 Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 8 / 8 (cl_khr_fp16) float 4 / 4
double 2 / 2 (cl_khr_fp64) Half-precision Floating-point support (cl_khr_fp16) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations No Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations No Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations No Address bits 64, Little-Endian Global memory size 2086998016 (1.944GiB) Error Correction support No Max memory allocation 521749504 (497.6MiB) Unified memory for Host and Device Yes Minimum alignment for any data type 128 bytes Alignment of base address 1024 bits (128 bytes) Global Memory cache type Read/Write Global Memory cache size <printDeviceInfo:89: get CL_DEVICE_GLOBAL_MEM_CACHE_SIZE : error -30> Global Memory cache line 64 bytes Image support Yes Max number of samplers per kernel 16 Max size for 1D images from buffer 65536 pixels Max 1D or 2D image array size 2048 images Base address alignment for 2D image buffers 32 bytes Pitch alignment for 2D image buffers 16 bytes Max 2D image size 65536x65536 pixels Max 3D image size 65536x65536x65536 pixels Max number of read image args 128 Max number of write image args 8 Local memory type Global Local memory size 32768 (32KiB) Max constant buffer size 65536 (64KiB) Max number of constant args 8 Max size of kernel argument 1024 Queue properties
Out-of-order execution Yes Profiling Yes Prefer user sync for interop No Profiling timer resolution 1000ns Execution capabilities
Run OpenCL kernels Yes Run native kernels No printf() buffer size 1048576 (1024KiB) Built-in kernels
Device Available Yes Compiler Available Yes Linker Available Yes Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_khr_image2d_from_buffer cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory

NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) ARM Platform clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [ARM] clCreateContext(NULL, ...) [default] Success [ARM] clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (2) Platform Name ARM Platform Device Name Mali-T628 Device Name Mali-T628 clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (2) Platform Name ARM Platform Device Name Mali-T628 Device Name Mali-T628

ICD loader properties ICD loader Name OpenCL ICD Loader ICD loader Vendor OCL Icd free software ICD loader Version 2.2.8 ICD loader Profile OpenCL 1.2 NOTE: your OpenCL library declares to support OpenCL 1.2, but it seems to support up to OpenCL 2.1 too.

./build/tools/caffe device_query I0218 02:06:41.756657 30356 common.cpp:379] Total devices: 2 I0218 02:06:41.757541 30356 common.cpp:380] CUDA devices: 0 I0218 02:06:41.757691 30356 common.cpp:381] OpenCL devices: 2 I0218 02:06:41.757817 30356 common.cpp:405] Device id: 0 I0218 02:06:41.757937 30356 common.cpp:407] Device backend: OpenCL I0218 02:06:41.758064 30356 common.cpp:409] Backend details: ARM: OpenCL 1.2 v1.r14p0-01rel0.0fe2d25ca074016740f8ab3fb451b151 I0218 02:06:41.758239 30356 common.cpp:411] Device vendor: ARM I0218 02:06:41.758369 30356 common.cpp:413] Name: Mali-T628 I0218 02:06:41.758502 30356 common.cpp:415] Total global memory: 2086998016 I0218 02:06:41.758649 30356 common.cpp:405] Device id: 1 I0218 02:06:41.758767 30356 common.cpp:407] Device backend: OpenCL I0218 02:06:41.758891 30356 common.cpp:409] Backend details: ARM: OpenCL 1.2 v1.r14p0-01rel0.0fe2d25ca074016740f8ab3fb451b151 I0218 02:06:41.759027 30356 common.cpp:411] Device vendor: ARM I0218 02:06:41.759150 30356 common.cpp:413] Name: Mali-T628 I0218 02:06:41.759344 30356 common.cpp:415] Total global memory: 2086998016

And final command: ./build/test/test_all.testbin --gtest_filter=OpenCLKernelCompileTest 0 ressults in attached log1.txt file. log1.txt

I used the cmake build system with config cmake ../ \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX:PATH=/opt/opencl-caffe \ -DCMAKE_CXX_FLAGS=-Wdeprecated-declarations \ -DOPENCL_INCLUDE_DIRS=/usr/include/CL \ -DBLAS=open \ -DOpenBLAS_INCLUDE_DIR=/opt/OpenBLAS/include \ -DOPENCL_LIBRARIES=/usr/lib/arm-linux-gnueabihf/libOpenCL.so

And I built opencl-caffe, tested on ARM mali GPU but dont know why runtest failed all test cases with GPUdevice.

Thank you.

BVLC / caffe

Runtest fails - core dump [opencl] #4188