Closed YutaOtsuka closed 8 years ago
Duplicate; see here: https://github.com/BVLC/caffe/issues/4179
Oh ok sorry, I see you already use ViennaCL-DEV; In that case we must ask @gongzg from Intel if he knows what could cause the issue.
This layer will not be used on actual networks though, if you can run the other tests, except for the ConvolutionLayerSpatial, it's fine.
I could pass other run-tests but I couldn't execute actual classification. So I thought not passing runtest was the problem.
Loading file: ../../../pictures/101_ObjectCategories/airplanes/image_0001.jpg
Classifying 1 inputs.
ViennaCL: FATAL ERROR: Could not find kernel 'fillbuffer_float' from program ''
Number of kernels in program: 0
std::exception
Segmentation fault (core dumped)
@YutaOtsuka Ok, that's interesting, let's see then:
./build/test/test_all.testbin --gtest_filter=*OpenCLKernelCompileTest* 0
clinfo
./build/tools/caffe device_query
These outputs might give us a hint as to what is going on. I myself have a GTX 980 with the latest driver which works well in OpenCL mode, so the Titan X should be no different.
I executed your command. It was following.
I0520 15:07:56.841437 5367 common.cpp:373] Total devices: 1
I0520 15:07:56.841604 5367 common.cpp:374] CUDA devices: 0
I0520 15:07:56.841610 5367 common.cpp:375] OpenCL devices: 1
I0520 15:07:56.841615 5367 common.cpp:399] Device id: 0
I0520 15:07:56.841620 5367 common.cpp:401] Device backend: OpenCL
I0520 15:07:56.841631 5367 common.cpp:403] Backend details: NVIDIA Corporation: OpenCL 1.2 CUDA 7.5.23
I0520 15:07:56.841637 5367 common.cpp:405] Device vendor: NVIDIA Corporation
I0520 15:07:56.841681 5367 common.cpp:407] Name: GeForce GTX TITAN X
I0520 15:07:56.841718 5367 common.cpp:409] Total global memory: 12884705280
@YutaOtsuka What about the other commands?
Oh, sorry. It's ./build/test/test_all.testbin --gtest_filter=OpenCLKernelCompileTest 0 Note: Google Test filter = OpenCLKernelCompileTest
[==========] Running 0 tests from 0 test cases.
[==========] 0 tests from 0 test cases ran. (0 ms total)
[ PASSED ] 0 tests.
It's clinfo.
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.2 CUDA 7.5.23
Platform Name: NVIDIA CUDA
Platform Vendor: NVIDIA Corporation
Platform Extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts
Platform Name: NVIDIA CUDA
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Device ID: 4318
Max compute units: 24
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 64
Max work group size: 1024
Preferred vector width char: 1
Preferred vector width short: 1
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 1
Native vector width short: 1
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1215Mhz
Address bits: 64
Max memory allocation: 3221176320
Image support: Yes
Max number of images read arguments: 256
Max number of images write arguments: 16
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 4096
Max image 3D height: 4096
Max image 3D depth: 4096
Max samplers within kernel: 32
Max size of kernel argument: 4352
Alignment (bits) of base address: 4096
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 128
Cache size: 393216
Global memory size: 12884705280
Constant buffer size: 65536
Max number of constant args: 9
Local memory type: Local
Local memory size: 49152
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1000
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x252b560
Name: GeForce GTX TITAN X
Vendor: NVIDIA Corporation
Device OpenCL C version: OpenCL C 1.2
Driver version: 352.68
Profile: FULL_PROFILE
Version: OpenCL 1.2 CUDA
Extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64
@YutaOtsuka
There is an error in executing
./build/test/test_all.testbin --gtest_filter=*OpenCLKernelCompileTest* 0
Note that there must be *-s around the filter keyword.
It was this. thank you.
Setting to use device 0
Note: Google Test filter = *OpenCLKernelCompileTest*
[==========] Running 2 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 1 test from OpenCLKernelCompileTest/0, where TypeParam = float
[ RUN ] OpenCLKernelCompileTest/0.TestCompile
Kernel bundle: activation: OK
Kernel bundle: auxiliary: OK
Kernel bundle: batch_reindex: OK
Kernel bundle: benchmark: OK
Kernel bundle: bias: OK
Kernel bundle: bnll: OK
Kernel bundle: channel: OK
Kernel bundle: concat: OK
Kernel bundle: contrastive_loss: OK
Kernel bundle: conv_layer_spatial: OK
Kernel bundle: crop: OK
Kernel bundle: dropout: OK
Kernel bundle: eltwise: OK
Kernel bundle: elu: OK
Kernel bundle: embed: OK
Kernel bundle: fft: OK
Kernel bundle: fillbuffer: OK
Kernel bundle: im2col: OK
Kernel bundle: im2col_nd: OK
Kernel bundle: lrn: OK
Kernel bundle: math: OK
Kernel bundle: mergecrop: OK
Kernel bundle: pooling: OK
Kernel bundle: pooling_nd: OK
Kernel bundle: pooling_sk: OK
Kernel bundle: slice: OK
Kernel bundle: softmax_loss: OK
Kernel bundle: solvers: OK
Kernel bundle: tile: OK
[ OK ] OpenCLKernelCompileTest/0.TestCompile (8 ms)
[----------] 1 test from OpenCLKernelCompileTest/0 (8 ms total)
[----------] 1 test from OpenCLKernelCompileTest/1, where TypeParam = double
[ RUN ] OpenCLKernelCompileTest/1.TestCompile
Kernel bundle: activation: OK
Kernel bundle: auxiliary: OK
Kernel bundle: batch_reindex: OK
Kernel bundle: benchmark: OK
Kernel bundle: bias: OK
Kernel bundle: bnll: OK
Kernel bundle: channel: OK
Kernel bundle: concat: OK
Kernel bundle: contrastive_loss: OK
Kernel bundle: conv_layer_spatial: OK
Kernel bundle: crop: OK
Kernel bundle: dropout: OK
Kernel bundle: eltwise: OK
Kernel bundle: elu: OK
Kernel bundle: embed: OK
Kernel bundle: fft: OK
Kernel bundle: fillbuffer: OK
Kernel bundle: im2col: OK
Kernel bundle: im2col_nd: OK
Kernel bundle: lrn: OK
Kernel bundle: math: OK
Kernel bundle: mergecrop: OK
Kernel bundle: pooling: OK
Kernel bundle: pooling_nd: OK
Kernel bundle: pooling_sk: OK
Kernel bundle: slice: OK
Kernel bundle: softmax_loss: OK
Kernel bundle: solvers: OK
Kernel bundle: tile: OK
[ OK ] OpenCLKernelCompileTest/1.TestCompile (8 ms)
[----------] 1 test from OpenCLKernelCompileTest/1 (8 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 2 test cases ran. (16 ms total)
[ PASSED ] 2 tests.
@YutaOtsuka So it looks like everything is fine. That's odd. What does your classification code look like? Share as much as possible.
I just used provided classify.py like this.
python ../../python/classify.py --model_def ./VGG_ILSVRC_16_layers_deploy.prototxt
--pretrained_model ./VGG_ILSVRC_16_layers.caffemodel
--gpu
--raw_scale 255 ../../../pictures/101_ObjectCategories/airplanes/image_0001.jpg
./result.npy
@YutaOtsuka Ah yes, the PyCaffe code did not initialize the OpenCL GPU correctly. I fixed it now. The downside of that PyCaffe code is that it only works with the first OpenCL/CUDA device present, which is a bit stupid but oh well, at least it should work for you now.
Can I get new pycaffe code?
@YutaOtsuka Just pull the latest version of the OpenCL-Caffe repository.
It worked correctly. Thank you very much!
@YutaOtsuka I wonder why you'd use OpenCL-Caffe instead of CUDA-Caffe on a Titan X though. It is quite a bit slower still.
I just wanted to analyze the OpenCL movement. If possible I'd like to use it in FPGA. I'm just thinking.
@YutaOtsuka Sounds good. Yes big improvements on speed are in progress.
Hi, i meet the same problems in opencl caffe with Nvidia GTX970.
I mv test_convolution_layer_spatial.cpp
to test_convolution_layer_spatial.log
, then:
make clean
make all
make test
make runtest
and I got this:
[----------] Global test environment tear-down
[==========] 2034 tests from 274 test cases ran. (480505 ms total)
[ PASSED ] 2033 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] NetTest/0.TestSharedWeightsUpdate, where TypeParam = caffe::CPUDevice<float>
Then I try to train a LeNet following http://caffe.berkeleyvision.org/gathered/examples/mnist.html I just add a line in layer conv1 of "examples/mnist/lenet_train_test.prototxt":
engine: SPATIAL
when i train the net with ./examples/mnist/train_lenet.sh
i got this:
I0622 14:02:52.293876 2826 solver.cpp:111] Creating training net from net file: examples/mnist/lenet_train_test.prototxt [libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 52:5: Unknown enumeration value of "SPATIAL" for field "engine". F0622 14:02:52.294034 2826 upgrade_proto.cpp:79] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: examples/mnist/lenet_train_test.prototxt * Check failure stack trace: * @ 0x7faeca694daa (unknown) @ 0x7faeca694ce4 (unknown) @ 0x7faeca6946e6 (unknown) @ 0x7faeca697687 (unknown) @ 0x7faecaa92ebe caffe::ReadNetParamsFromTextFileOrDie() @ 0x7faecaac99cb caffe::Solver<>::InitTrainNet() @ 0x7faecaac9eb6 caffe::Solver<>::Init() @ 0x7faecaaca1c6 caffe::Solver<>::Solver() @ 0x7faecaac3c03 caffe::Creator_SGDSolver<>() @ 0x415c07 caffe::SolverRegistry<>::CreateSolver() @ 0x40f323 train() @ 0x40cd6c main @ 0x7faec979cf45 (unknown) @ 0x40d56b (unknown) @ (nil) (unknown) Aborted (core dumped)
what do you think?
@buaapengbo
That the shared weights update test fails is fine. You can ignore it.
The SPATIAL engine is mainly for intel chips, and I believe the correct identifier is INTEL_SPATIAL
now, see here for the available engines:
enum Engine {
DEFAULT = 0;
CAFFE = 1;
CUDNN = 2;
LIBDNN = 3;
INTEL_SPATIAL = 4;
FFT = 5;
}
And you have to enable USE_INTEL_SPATIAL := 1
in the Makefile.
this engine is mainly for Intel GPUs though. Use the CAFFE, DEFAULT or LIBDNN for most nVidia and AMD chips.
@naibaf7
Thank you for your quickly reply!
I understand the test TestSharedWeightsUpdate
is unnecessary.
I just use Intel CPU i7-6700
and Nvidia GTX970
, is this mean that I donot need to set USE_INTEL_SPATIAL := 1
in the Makefile.config?
I set engine: CAFFE
in the prototxt file and train:
./examples/mnist/train_lenet.sh
I got these messages:
I0622 14:32:23.470811 3920 solver.cpp:251] Iteration 800, loss = 0.216637 I0622 14:32:23.470849 3920 solver.cpp:267] Train net output #0: loss = 0.216637 (* 1 = 0.216637 loss) I0622 14:32:23.470856 3920 sgd_solver.cpp:112] Iteration 800, lr = 0.00943913 I0622 14:32:27.685243 3920 solver.cpp:251] Iteration 900, loss = 0.154349 I0622 14:32:27.685279 3920 solver.cpp:267] Train net output #0: loss = 0.154349 (* 1 = 0.154349 loss) I0622 14:32:27.685287 3920 sgd_solver.cpp:112] Iteration 900, lr = 0.00937411 I0622 14:32:31.858384 3920 solver.cpp:479] Snapshotting to binary proto file examples/mnist/lenet_iter_1000.caffemodel I0622 14:32:31.869387 3920 sgd_solver.cpp:323] Snapshotting solver state to binary proto file examples/mnist/lenet_iter_1000.solverstate I0622 14:32:31.909396 3920 solver.cpp:341] Iteration 1000, loss = 0.0869865 I0622 14:32:31.909431 3920 solver.cpp:362] Iteration 1000, Testing net (#0) I0622 14:32:35.642645 3920 solver.cpp:429] Test net output #0: accuracy = 0.981 I0622 14:32:35.642683 3920 solver.cpp:429] Test net output #1: loss = 0.0592155 (* 1 = 0.0592155 loss) I0622 14:32:35.642691 3920 solver.cpp:346] Optimization Done. I0622 14:32:35.642696 3920 caffe.cpp:249] Optimization Done.
I have a question, did I used the Opencl nvidia?
@buaapengbo If you have disabled the CUDA backend in the Makefile.config or set the -gpu flag to 1 instead of 0 (when enabling the CUDA backend as well), then yes, you used OpenCL.
You can also test which devices will be used with this command:
./build/tools/caffe device_query
If you have only OpenCL it will be:
0: GTX 970 OpenCL
If you have OpenCL and CUDA enabled:
0: GTX 970 CUDA
1: GTX 970 OpenCL
If you have installed OpenCL SDK by Intel, then the i7-6700 will also show up. If you have installed beignet OpenCL & enabled the i7-6700 iGPU then this will also show up.
@naibaf7 Thank you for your apply.
My Makefile.config is:
# USE_CUDA := 1
USE_GREENTEA := 1
I also run the command:
./build/tools/caffe device_query
got this:
pengbo@FPGA-Accel-Server:~/cnns/git/caffe$ ./build/tools/caffe device_query I0623 11:07:02.536579 5995 common.cpp:373] Total devices: 1 I0623 11:07:02.536739 5995 common.cpp:374] CUDA devices: 0 I0623 11:07:02.536747 5995 common.cpp:375] OpenCL devices: 1 I0623 11:07:02.536752 5995 common.cpp:399] Device id: 0 I0623 11:07:02.536757 5995 common.cpp:401] Device backend: OpenCL I0623 11:07:02.536769 5995 common.cpp:403] Backend details: NVIDIA Corporation: OpenCL 1.2 CUDA 7.5.18 I0623 11:07:02.536777 5995 common.cpp:405] Device vendor: NVIDIA Corporation I0623 11:07:02.536806 5995 common.cpp:407] Name: GeForce GTX 970 I0623 11:07:02.536837 5995 common.cpp:409] Total global memory: 4294770688
I haven't install OpenCL SDK by Intel or beignet OpenCL. Is this means I do use the OpenCL of GTX970 but donot use CUDA of GTX970 ?
and if I enable CUDA backend, I can use the command:
./examples/mnist/train_lenet.sh -gpu0
to use device 0(default is CUDA) to train and test .
./examples/mnist/train_lenet.sh -gpu1
to use device 1(default is OpenCL) to train and test
I understand, thank you very much!
@buaapengbo Exactly, you got that right :) cool, isn't it? You can also try to compile with USE_LIBDNN, which should give better performance for OpenCL and CUDA. It's slower than cuDNN but faster than cuBLAS/clBLAS/ViennaCL.
@naibaf7 That's very cool ! I modify my Makefile.config:
USE_CUDA := 1
USE_GREENTEA := 1
then
make clean
make all
make test
make runtest
.build_release/tools/caffe device_query
I got this:
pengbo@FPGA-Accel-Server:~/cnns/git/caffe$ .build_release/tools/caffe device_query I0623 13:30:33.994971 14615 common.cpp:373] Total devices: 2 I0623 13:30:33.995152 14615 common.cpp:374] CUDA devices: 1 I0623 13:30:33.995160 14615 common.cpp:375] OpenCL devices: 1 I0623 13:30:33.995340 14615 common.cpp:382] Device id: 0 I0623 13:30:33.995348 14615 common.cpp:384] Device backend: CUDA I0623 13:30:33.995368 14615 common.cpp:386] Backend details: CUDA I0623 13:30:33.995373 14615 common.cpp:388] Device vendor: NVIDIA Corporation I0623 13:30:33.995376 14615 common.cpp:390] Name: GeForce GTX 970 I0623 13:30:33.995381 14615 common.cpp:392] Total global memory: 4294770688 I0623 13:30:33.995391 14615 common.cpp:399] Device id: 1 I0623 13:30:33.995398 14615 common.cpp:401] Device backend: OpenCL I0623 13:30:33.995410 14615 common.cpp:403] Backend details: NVIDIA Corporation: OpenCL 1.2 CUDA 7.5.18 I0623 13:30:33.995417 14615 common.cpp:405] Device vendor: NVIDIA Corporation I0623 13:30:33.995440 14615 common.cpp:407] Name: GeForce GTX 970 I0623 13:30:33.995471 14615 common.cpp:409] Total global memory: 4294770688
OK, I can train the net with -gpu0
with CUDA and -gpu1
with OpenCL!
Thank you very much for your help! @naibaf7
I was getting the same error ViennaCL: FATAL ERROR: Could not find kernel 'fillbuffer_float' from program ''
despite having passing Caffe tests shown above etc.
Explicitly setting the device in the python code e.g. caffe.set_device(0)
fixed the problem in my case.
Repo: BVLC/caffe
Branch: opencl
Commit: 72edcdc
Thanks!
@ubergarm Yes OpenCL Caffe requires explicit device initialization and cannot default to the primary device like CUDA Caffe. The test code does call caffe.set_device(x), where x is the command line number passed to the test suite.
@naibaf7
Hello, I got errors with opencl-caffe while runtest as below:
ViennaCL: FATAL ERROR: Could not find kernel 'im2col_float' from program ''
Number of kernels in program: 0
unknown file: Failure
C++ exception with description "Kernel not found" thrown in the test body.
[ FAILED ] ConvolutionLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice
And I used some commands as you recommended above. And the results are:
clinfo Number of platforms 1 Platform Name ARM Platform Platform Vendor ARM Platform Version OpenCL 1.2 v1.r14p0-01rel0.0fe2d25ca074016740f8ab3fb451b151 Platform Profile FULL_PROFILE Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_khr_image2d_from_buffer cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory Platform Extensions function suffix ARM
Platform Name ARM Platform
Number of devices 2
Device Name Mali-T628
Device Vendor ARM
Device Vendor ID 0x6200010
Device Version OpenCL 1.2 v1.r14p0-01rel0.0fe2d25ca074016740f8ab3fb451b151
Driver Version 1.2
Device OpenCL C Version OpenCL C 1.2 v1.r14p0-01rel0.0fe2d25ca074016740f8ab3fb451b151
Device Type GPU
Device Profile FULL_PROFILE
Max compute units 4
Max clock frequency 600MHz
Device Partition (core)
Max number of sub-devices 0
Supported partition types None
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size multiple 4
Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 8 / 8 (cl_khr_fp16)
float 4 / 4
double 2 / 2 (cl_khr_fp64)
Half-precision Floating-point support (cl_khr_fp16)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Address bits 64, Little-Endian
Global memory size 2086998016 (1.944GiB)
Error Correction support No
Max memory allocation 521749504 (497.6MiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size <printDeviceInfo:89: get CL_DEVICE_GLOBAL_MEM_CACHE_SIZE : error -30>
Global Memory cache line 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 65536 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 32 bytes
Pitch alignment for 2D image buffers 16 bytes
Max 2D image size 65536x65536 pixels
Max 3D image size 65536x65536x65536 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Global
Local memory size 32768 (32KiB)
Max constant buffer size 65536 (64KiB)
Max number of constant args 8
Max size of kernel argument 1024
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
printf() buffer size 1048576 (1024KiB)
Built-in kernels
Device Available Yes
Compiler Available Yes
Linker Available Yes
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_khr_image2d_from_buffer cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory
Device Name Mali-T628
Device Vendor ARM
Device Vendor ID 0x6200010
Device Version OpenCL 1.2 v1.r14p0-01rel0.0fe2d25ca074016740f8ab3fb451b151
Driver Version 1.2
Device OpenCL C Version OpenCL C 1.2 v1.r14p0-01rel0.0fe2d25ca074016740f8ab3fb451b151
Device Type GPU
Device Profile FULL_PROFILE
Max compute units 2
Max clock frequency 600MHz
Device Partition (core)
Max number of sub-devices 0
Supported partition types None
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size multiple 4
Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 8 / 8 (cl_khr_fp16)
float 4 / 4
double 2 / 2 (cl_khr_fp64)
Half-precision Floating-point support (cl_khr_fp16)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Address bits 64, Little-Endian
Global memory size 2086998016 (1.944GiB)
Error Correction support No
Max memory allocation 521749504 (497.6MiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size <printDeviceInfo:89: get CL_DEVICE_GLOBAL_MEM_CACHE_SIZE : error -30>
Global Memory cache line 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 65536 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 32 bytes
Pitch alignment for 2D image buffers 16 bytes
Max 2D image size 65536x65536 pixels
Max 3D image size 65536x65536x65536 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Global
Local memory size 32768 (32KiB)
Max constant buffer size 65536 (64KiB)
Max number of constant args 8
Max size of kernel argument 1024
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
printf() buffer size 1048576 (1024KiB)
Built-in kernels
Device Available Yes
Compiler Available Yes
Linker Available Yes
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_khr_image2d_from_buffer cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory
NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) ARM Platform clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [ARM] clCreateContext(NULL, ...) [default] Success [ARM] clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (2) Platform Name ARM Platform Device Name Mali-T628 Device Name Mali-T628 clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (2) Platform Name ARM Platform Device Name Mali-T628 Device Name Mali-T628
ICD loader properties ICD loader Name OpenCL ICD Loader ICD loader Vendor OCL Icd free software ICD loader Version 2.2.8 ICD loader Profile OpenCL 1.2 NOTE: your OpenCL library declares to support OpenCL 1.2, but it seems to support up to OpenCL 2.1 too.
./build/tools/caffe device_query I0218 02:06:41.756657 30356 common.cpp:379] Total devices: 2 I0218 02:06:41.757541 30356 common.cpp:380] CUDA devices: 0 I0218 02:06:41.757691 30356 common.cpp:381] OpenCL devices: 2 I0218 02:06:41.757817 30356 common.cpp:405] Device id: 0 I0218 02:06:41.757937 30356 common.cpp:407] Device backend: OpenCL I0218 02:06:41.758064 30356 common.cpp:409] Backend details: ARM: OpenCL 1.2 v1.r14p0-01rel0.0fe2d25ca074016740f8ab3fb451b151 I0218 02:06:41.758239 30356 common.cpp:411] Device vendor: ARM I0218 02:06:41.758369 30356 common.cpp:413] Name: Mali-T628 I0218 02:06:41.758502 30356 common.cpp:415] Total global memory: 2086998016 I0218 02:06:41.758649 30356 common.cpp:405] Device id: 1 I0218 02:06:41.758767 30356 common.cpp:407] Device backend: OpenCL I0218 02:06:41.758891 30356 common.cpp:409] Backend details: ARM: OpenCL 1.2 v1.r14p0-01rel0.0fe2d25ca074016740f8ab3fb451b151 I0218 02:06:41.759027 30356 common.cpp:411] Device vendor: ARM I0218 02:06:41.759150 30356 common.cpp:413] Name: Mali-T628 I0218 02:06:41.759344 30356 common.cpp:415] Total global memory: 2086998016
And final command: ./build/test/test_all.testbin --gtest_filter=OpenCLKernelCompileTest 0 ressults in attached log1.txt file. log1.txt
I used the cmake build system with config cmake ../ \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX:PATH=/opt/opencl-caffe \ -DCMAKE_CXX_FLAGS=-Wdeprecated-declarations \ -DOPENCL_INCLUDE_DIRS=/usr/include/CL \ -DBLAS=open \ -DOpenBLAS_INCLUDE_DIR=/opt/OpenBLAS/include \ -DOPENCL_LIBRARIES=/usr/lib/arm-linux-gnueabihf/libOpenCL.so
And I built opencl-caffe, tested on ARM mali GPU but dont know why runtest failed all test cases with GPUdevice.
Thank you.
Hi, I'm trying to use OpenCL-Caffe on GeForce GTX TITAN X. I could pass "make all" and "make test". But I had a "segmentation error" in "make runtest".My environment is ubuntu14.04, GeForece GTX TITAN X and OpenCL 1.2.I didn't change the Makefile.config. I'm using viennacl-dev. What do you think?