OpenCL Memory Mapping fails on integrated GPU platform

gplhegde commented 7 years ago

Hi, I am trying OpenCL Caffe on ASUS TinkerBoard that has integrated Mali GPU. I am able to compile the OpenCL caffe using ViennaCL and successfully run the OpenCL layer tests (below is the test log). However, when I try to run the Alexnet model application using GPU mode, the mapping of device buffer fails leading to an assertion as below.

Issue Log

linaro@linaro-alip:/opt/opencl-caffe$ ./distribute/bin/caffe.bin time -model ./models/bvlc_alexnet/deploy.prototxt -gpu 0 I1125 05:02:52.698902 32653 caffe.cpp:391] Use GPU with device ID 0 I1125 05:02:52.701169 32653 device.cpp:62] CL_DEVICE_HOST_UNIFIED_MEMORY: 1 ..

.. I1125 05:03:11.154116 32653 net.cpp:281] Memory required for data: 83232440 I1125 05:03:11.154492 32653 caffe.cpp:406] Performing Forward F1125 05:03:17.930557 32653 syncedmem.cpp:256] Check failed: mapped_ptr == cpu_ptr_ (0 vs. 0x8cd89000) Device claims it support zero copy but failed to create correct user ptr buffer *** Check failure stack trace: *** @ 0xb6bc2e2e google::LogMessage::Fail() @ 0xb6bc442a google::LogMessage::SendToLog() @ 0xb6bc2adc google::LogMessage::Flush() @ 0xb6bc4a4c google::LogMessageFatal::~LogMessageFatal() @ 0xb6ea4464 caffe::SyncedMemory::mutable_gpu_data() @ 0xb6de363c caffe::Blob<>::mutable_gpu_data() @ 0xb6edcd58 caffe::LRNLayer<>::CrossChannelForward_gpu() @ 0xb6edcbce caffe::LRNLayer<>::Forward_gpu() @ 0xb6dc90f2 caffe::Net<>::ForwardFromTo() @ 0xb6dc9348 caffe::Net<>::Forward() @ 0x7f64a3d6 time() @ 0x7f64710a main @ 0xb46844aa __libc_start_main Aborted ### OpenCL Layer Test Logs linaro@linaro-alip:/opt/opencl-caffe$ ./build/test/test_all.testbin --gtest_filter=*OpenCLKernelCompileTest* 0 Setting to use device 0 Note: Google Test filter = *OpenCLKernelCompileTest* [==========] Running 2 tests from 2 test cases. [----------] Global test environment set-up. [----------] 1 test from OpenCLKernelCompileTest/0, where TypeParam = float [ RUN ] OpenCLKernelCompileTest/0.TestCompile Kernel bundle: activation: OK Kernel bundle: auxiliary: OK Kernel bundle: batch_norm: OK Kernel bundle: batch_reindex: OK Kernel bundle: benchmark: OK Kernel bundle: bias: OK Kernel bundle: bnll: OK Kernel bundle: channel: OK Kernel bundle: concat: OK Kernel bundle: contrastive_loss: OK Kernel bundle: conv_layer_spatial: OK Kernel bundle: conv_spatial_helper: OK Kernel bundle: crop: OK Kernel bundle: dropout: OK Kernel bundle: eltwise: OK Kernel bundle: elu: OK Kernel bundle: embed: OK Kernel bundle: fft: OK Kernel bundle: fillbuffer: OK Kernel bundle: im2col: OK Kernel bundle: im2col_nd: OK Kernel bundle: lrn: OK Kernel bundle: lstm_unit: OK Kernel bundle: math: OK Kernel bundle: mergecrop: OK Kernel bundle: pooling: OK Kernel bundle: pooling_nd: OK Kernel bundle: pooling_sk: OK Kernel bundle: slice: OK Kernel bundle: softmax_loss: OK Kernel bundle: solvers: OK Kernel bundle: tile: OK [ OK ] OpenCLKernelCompileTest/0.TestCompile (1998 ms) [----------] 1 test from OpenCLKernelCompileTest/0 (1998 ms total) [----------] 1 test from OpenCLKernelCompileTest/1, where TypeParam = double [ RUN ] OpenCLKernelCompileTest/1.TestCompile Kernel bundle: activation: OK Kernel bundle: auxiliary: OK Kernel bundle: batch_norm: OK Kernel bundle: batch_reindex: OK Kernel bundle: benchmark: OK Kernel bundle: bias: OK Kernel bundle: bnll: OK Kernel bundle: channel: OK Kernel bundle: concat: OK Kernel bundle: contrastive_loss: OK Kernel bundle: conv_layer_spatial: OK Kernel bundle: conv_spatial_helper: OK Kernel bundle: crop: OK Kernel bundle: dropout: OK Kernel bundle: eltwise: OK Kernel bundle: elu: OK Kernel bundle: embed: OK Kernel bundle: fft: OK Kernel bundle: fillbuffer: OK Kernel bundle: im2col: OK Kernel bundle: im2col_nd: OK Kernel bundle: lrn: OK Kernel bundle: lstm_unit: OK Kernel bundle: math: OK Kernel bundle: mergecrop: OK Kernel bundle: pooling: OK Kernel bundle: pooling_nd: OK Kernel bundle: pooling_sk: OK Kernel bundle: slice: OK Kernel bundle: softmax_loss: OK Kernel bundle: solvers: OK Kernel bundle: tile: OK [ OK ] OpenCLKernelCompileTest/1.TestCompile (1927 ms) [----------] 1 test from OpenCLKernelCompileTest/1 (1927 ms total) [----------] Global test environment tear-down [==========] 2 tests from 2 test cases ran. (3925 ms total) [ PASSED ] 2 tests. ### Device Query linaro@linaro-alip:/opt/opencl-caffe$ ./distribute/bin/caffe.bin device_query -gpu all I1125 05:00:58.776921 32602 common.cpp:433] Total devices: 1 I1125 05:00:58.777464 32602 common.cpp:434] CUDA devices: 0 I1125 05:00:58.777488 32602 common.cpp:435] OpenCL devices: 1 I1125 05:00:58.777501 32602 common.cpp:459] Device id: 0 I1125 05:00:58.777516 32602 common.cpp:461] Device backend: OpenCL I1125 05:00:58.777529 32602 common.cpp:463] Backend details: ARM: OpenCL 1.2 v1.r9p0-05rel0-git(f980191).e4ba9e4c6ff8005348d0332aae160089 I1125 05:00:58.777585 32602 common.cpp:465] Device vendor: ARM I1125 05:00:58.777607 32602 common.cpp:467] Name: Mali-T760 I1125 05:00:58.777631 32602 common.cpp:469] Total global memory: 2110091264 ### Platform details Platform : ASUS TinkerBoard OS : linaro@linaro-alip:/opt/opencl-caffe$ lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 9.0 (stretch) Release: 9.0 Codename: stretch Compiler: linaro@linaro-alip:/opt/opencl-caffe$ g++ --version g++ (Debian 6.3.0-18) 6.3.0 20170516 Thanks for the help. Gopal

naibaf7 commented 7 years ago

This is a known issue. Please compile Caffe with the following option: DISABLE_DEVICE_HOST_UNIFIED_MEMORY

gongzg commented 7 years ago

@gplhegde The zero copy function requires to get the same address with clEnqueueMapBuffer API. This assertion indicates that your platform could not meet this requirement. Maybe it's better to disable the unify_memory automatically for this case. @naibaf7 what's do you think?

gplhegde commented 7 years ago

Thanks for your response. Disabling the unified memory solves the assertion. However, I see two major issues.

Difference in the computation results between CPU and GPU mode.
GPU mode is 5-6x slower compared to CPU mode (using OpenBLAS). This is confirmed using caffe time on AlexNet model. Is this because of initial finetuning of OpenCL kernels ?

Do you see any way to overcome these? clEnqueueMapBuffer works for other sample OpenCL applications that I tested on my platform. I am wondering what special requirement is needed here.

naibaf7 commented 7 years ago

@gplhegde OpenCL Caffe so far only has been optimized for AMD, nVidia and Intel GPUs. Mali and Adreno GPUs are not running optimally yet, however I do have an Asus Tinkerboard on my own and am working on suitable kernels for those chips as well.

gplhegde commented 6 years ago

@naibaf7 Thanks for your response. I understand the reason behind the slow performance. However, I am wondering why the functional correctness is an issue here. I am getting results from the GPU mode which differ by large margin compared to CPU mode. Are you seeing any such issues? Small differences of the order 1e-6 are expected as the float ops are being carried out different hardware. However, I see differences like 0.6 vs 0.9 !

naibaf7 commented 6 years ago

@gplhegde I haven't done verification on Mali on functional correctness either. It should not occur however sometimes there are issues in the OpenCL implementation by vendors. So far I have only verified results on nVidia, AMD and Intel GPUs. However a large update that will hopefully also take care of some Mali related issues will be released soon.

liuyajian commented 6 years ago

@gplhegde @naibaf7 @gongzg Hi,guys. while i test with alexnet ,it works well,but if i test with googlenet the same problem as @gplhegde happened. And my device query is the same as @gplhegde . What can i do with it?

BVLC / caffe

OpenCL Memory Mapping fails on integrated GPU platform #5895

Issue Log