dividiti / ck-caffe

Collective Knowledge workflow for Caffe to automate installation across diverse platforms and to collaboratively evaluate and optimize Caffe-based workloads across diverse hardware, software and data sets (compilers, libraries, tools, models, inputs):
http://cKnowledge.org
BSD 3-Clause "New" or "Revised" License
193 stars 41 forks source link

Failing in runing "00-classification.ipynb" example GPU mode with ARM Mali GPU #118

Closed XiaoMaol closed 7 years ago

XiaoMaol commented 7 years ago

Hey @psyhtest and @gfursin:

I really appreciate your help thus far. Sorry that I run into more problems again. I installed my caffe as suggested by you with the command

ck install package:lib-caffe-bvlc-opencl-clblast-universal --env.DISABLE_DEVICE_HOST_UNIFIED_MEMORY=ON --env.CK_HOST_CPU_NUMBER_OF_PROCESSORS=2 --env.CAFFE_BUILD_PYTHON=ON --env.CK_MAKE_CMD2="make pycaffe"

I have some problems run the example "00-classification.ipynb," provided by Caffe in the directory

~/CK-TOOLS/lib-caffe-bvlc-opencl-clblast-master-gcc-5.4.0-linux-32/install/examples

I run by the command

ck xset env tags=lib,caffe && . ./tmp-ck-env.bat && jupyter notebook

I can perfectly import caffe, and make the cpu load the weights, deploy, and run. But I cannot make the gpu run. the program died in the line

caffe.set_device(0)  # if we have multiple GPUs, pick the first one
caffe.set_mode_gpu()
net.forward()  # run once before timing to set up memory
%timeit net.forward()

it produced error message

Error Message:
I0724 17:09:56.982797 25836 device.cpp:56] CL_DEVICE_HOST_UNIFIED_MEMORY: disabled
std::exception

when I opened the debugger, the error message changed into caffe_mali_1_2

Can you guys reproduce the error? Have you guy successfully run the Caffe Example on the Mali GPU ?

/////////////////////////////////////////////////////////////////////////////////////////////////////////////// P.S.

Suddenly, it reminded me the command for the installation of ck-caffe in #114 sugggested by @psyhtest :

ck install package:lib-caffe-bvlc-opencl-clblast-universal \
  --env.DISABLE_DEVICE_HOST_UNIFIED_MEMORY=ON \
  --env.CK_HOST_CPU_NUMBER_OF_PROCESSORS=2

Espeically

--env.DISABLE_DEVICE_HOST_UNIFIED_MEMORY=ON 

what does flag do? why does it cause the kernel to die?

Furthermore, I have tried to install by command

ck install package:lib-caffe-bvlc-opencl-clblast-universal \
  --env.CK_HOST_CPU_NUMBER_OF_PROCESSORS=2

without the flag

--env.DISABLE_DEVICE_HOST_UNIFIED_MEMORY=ON 

With this option, I can import caffe, but when I am trying to load the weights, the program's kernel died with after line

net = caffe.Net(model_def,      # defines the structure of the model
                model_weights,  # contains the trained weights
                caffe.TEST)     # use test mode (e.g., don't perform dropout)

with error ck-caffe-issue3 the error in command line is ck-caffe-issue4

It died while attemping to load data for weights.

psyhtest commented 7 years ago

@XiaoMaol First, let me explain what the DISABLE_DEVICE_HOST_UNIFIED_MEMORY flag does. It was introduced as a workaround for a rather annoying ambiguity of the OpenCL specification that says that CL_DEVICE_HOST_UNIFIED_MEMORY property of clGetDeviceInfo():

Is CL_TRUE if the device and the host have a unified memory subsystem and is CL_FALSE otherwise.

Now, what exactly is "unified memory subsystem" is open to interpretation. When people come from desktop and server background, they assumed that memory is unified in the OpenCL 2.0 sense: the host and the device can share the same pointers; therefore, no copy between the host address space and the device address space is required. As a consequence, you can see code like this:

#define ZEROCOPY_SUPPORTED(device, ptr, size) \
             (device->is_host_unified())
<...>
              CHECK_EQ(mapped_ptr, cpu_ptr_)
                << "Device claims it support zero copy"
                << " but failed to create correct user ptr buffer";

(see syncedmem.cpp)

Unfortunately, mobile GPU vendors started returning CL_TRUE even for OpenCL 1.x implementations. In their interpretation, the CPU and the GPU in a system-on-chip typically share the same physical memory; therefore, they argue, the memory subsystem is unified. In fact, the CPU and the GPU cannot share the same pointers. You will hopefully see how this is at odds with the expectation from applications like OpenCL Caffe, and leads to checks similar to the above failing at runtime. (Probably this is what you are getting when trying to load the weights.) To workaround this issue, you can explicitly specify at build time that you wish Caffe to ignore whatever the driver is saying, and assume that the memory subsystem is not unified. As you have probably guessed by now, you use DISABLE_DEVICE_HOST_UNIFIED_MEMORY=ON to do that.

Now, the output you posted seems to suggest that disabling this property leads to an exception:

Error Message:
I0724 17:09:56.982797 25836 device.cpp:56] CL_DEVICE_HOST_UNIFIED_MEMORY: disabled
std::exception

I believe, however, this is just an unfortunate intermixing of log info (which gets printed even when everything goes well) with the exception message.

psyhtest commented 7 years ago

You may now ask, why do you have this Jupyter error in the first place? The honest answer is that I have no idea :). But if I were to guess, the device query seemed to be somewhat different for OpenCL. For example, for program:caffe, the query_gpu_cuda command looks like:

"run_cmd_main": "$<<CK_CAFFE_BIN>>$ device_query --gpu=$<<CAFFE_COMPUTE_DEVICE_ID>>$"

while the query_gpu_opencl command looks like:

"run_cmd_main": "$<<CK_CAFFE_BIN>>$ device_query"

So there may well be some difference in how a device gets selected too.

Maybe @naibaf7 has a better idea?

naibaf7 commented 7 years ago

Without having read the whole context, it is just important to use set_mode_gpu() before set_device(), otherwise it will fail. But this is the only "more strict" rule compared to CUDA Caffe.

psyhtest commented 7 years ago

From @XiaoMaol's initial comment, the order of these calls is reversed:

caffe.set_device(0) # if we have multiple GPUs, pick the first one caffe.set_mode_gpu()

XiaoMaol commented 7 years ago

Hey, @naibaf7 and @psyhtest:

I have tried to reverse the order of the call, but it still does not work. I think the problem is somewhere else, but I do not know where.

what other more information can I provide to give a better description of the problem? Thanks!

XiaoMaol commented 7 years ago

Today, I have tried Caffe without the Python Layer, and this is what I get.

When I tried to run with the GPU device 0, the program is killed: caffe_gpu_device_0_trial_1 caffe_gpu_device0_trail_1_result

when I run the program with device1, I have gotten that (Notice that the first iteration takes around 6 mins, then the program is killed): caffe_gpu_device1_trail1_command caffe_gpu_device1_trail1_results

However, when I have tried to run device 1 again, the program seems to freeze.

When I tried Caffe with gdb, the program just seems to freeze.

naibaf7 commented 7 years ago

Sounds a lot like a faulty driver. Can you try an absolute minimal network with just one fully connected layer to differentiate if Caffe works on that driver at all or not?

XiaoMaol commented 7 years ago

Hey @naibaf7, @psyhtest : The Caffe can run a smaller network. In addition, when I reduce the input image size of the Alexnet into 1 from 10, it also works. Thus, it seems to be a problem of memory rather than Caffe itself. However, I am suspecting the crash is due to the memory of my device rather than GPU memory. I am currently also using ARM computing library as well. Its AlexNet also crash when the input batch number is more than 5 or 6. (Here is the issue: https://github.com/ARM-software/ComputeLibrary/issues/190)

and the memory of the device is screenshot from 2017-08-02 01-40-38

The platform is

Linux odroid 3.10.103-124 #1 SMP PREEMPT Tue Oct 11 11:51:06 UTC 2016 armv7l armv7l armv7l GNU/Linux

I am just wondering, Is 2GB of the memory too small to properly Caffe?

The GPU info is: mali_gpu_info1 mali_gpu_info2

psyhtest commented 7 years ago

@XiaoMaol On Odroid and similar system-on-a-chip platform, the total system memory (e.g. 2 GB) is shared between all the devices. The GPU doesn't have any dedicated memory like on desktop or server cards. When you run out of this memory, you cannot use the GPU either.