Closed XiaoMaol closed 7 years ago
@XiaoMaol First, let me explain what the DISABLE_DEVICE_HOST_UNIFIED_MEMORY
flag does. It was introduced as a workaround for a rather annoying ambiguity of the OpenCL specification that says that CL_DEVICE_HOST_UNIFIED_MEMORY
property of clGetDeviceInfo()
:
Is
CL_TRUE
if the device and the host have a unified memory subsystem and isCL_FALSE
otherwise.
Now, what exactly is "unified memory subsystem" is open to interpretation. When people come from desktop and server background, they assumed that memory is unified in the OpenCL 2.0 sense: the host and the device can share the same pointers; therefore, no copy between the host address space and the device address space is required. As a consequence, you can see code like this:
#define ZEROCOPY_SUPPORTED(device, ptr, size) \
(device->is_host_unified())
<...>
CHECK_EQ(mapped_ptr, cpu_ptr_)
<< "Device claims it support zero copy"
<< " but failed to create correct user ptr buffer";
(see syncedmem.cpp)
Unfortunately, mobile GPU vendors started returning CL_TRUE
even for OpenCL 1.x implementations. In their interpretation, the CPU and the GPU in a system-on-chip typically share the same physical memory; therefore, they argue, the memory subsystem is unified. In fact, the CPU and the GPU cannot share the same pointers. You will hopefully see how this is at odds with the expectation from applications like OpenCL Caffe, and leads to checks similar to the above failing at runtime. (Probably this is what you are getting when trying to load the weights.) To workaround this issue, you can explicitly specify at build time that you wish Caffe to ignore whatever the driver is saying, and assume that the memory subsystem is not unified. As you have probably guessed by now, you use DISABLE_DEVICE_HOST_UNIFIED_MEMORY=ON
to do that.
Now, the output you posted seems to suggest that disabling this property leads to an exception:
Error Message:
I0724 17:09:56.982797 25836 device.cpp:56] CL_DEVICE_HOST_UNIFIED_MEMORY: disabled
std::exception
I believe, however, this is just an unfortunate intermixing of log info (which gets printed even when everything goes well) with the exception message.
You may now ask, why do you have this Jupyter error in the first place? The honest answer is that I have no idea :). But if I were to guess, the device query seemed to be somewhat different for OpenCL. For example, for program:caffe, the query_gpu_cuda
command looks like:
"run_cmd_main": "$<<CK_CAFFE_BIN>>$ device_query --gpu=$<<CAFFE_COMPUTE_DEVICE_ID>>$"
while the query_gpu_opencl
command looks like:
"run_cmd_main": "$<<CK_CAFFE_BIN>>$ device_query"
So there may well be some difference in how a device gets selected too.
Maybe @naibaf7 has a better idea?
Without having read the whole context, it is just important to use set_mode_gpu()
before set_device()
, otherwise it will fail. But this is the only "more strict" rule compared to CUDA Caffe.
From @XiaoMaol's initial comment, the order of these calls is reversed:
caffe.set_device(0) # if we have multiple GPUs, pick the first one caffe.set_mode_gpu()
Hey, @naibaf7 and @psyhtest:
I have tried to reverse the order of the call, but it still does not work. I think the problem is somewhere else, but I do not know where.
what other more information can I provide to give a better description of the problem? Thanks!
Today, I have tried Caffe without the Python Layer, and this is what I get.
When I tried to run with the GPU device 0, the program is killed:
when I run the program with device1, I have gotten that (Notice that the first iteration takes around 6 mins, then the program is killed):
However, when I have tried to run device 1 again, the program seems to freeze.
When I tried Caffe with gdb, the program just seems to freeze.
Sounds a lot like a faulty driver. Can you try an absolute minimal network with just one fully connected layer to differentiate if Caffe works on that driver at all or not?
Hey @naibaf7, @psyhtest : The Caffe can run a smaller network. In addition, when I reduce the input image size of the Alexnet into 1 from 10, it also works. Thus, it seems to be a problem of memory rather than Caffe itself. However, I am suspecting the crash is due to the memory of my device rather than GPU memory. I am currently also using ARM computing library as well. Its AlexNet also crash when the input batch number is more than 5 or 6. (Here is the issue: https://github.com/ARM-software/ComputeLibrary/issues/190)
and the memory of the device is
The platform is
Linux odroid 3.10.103-124 #1 SMP PREEMPT Tue Oct 11 11:51:06 UTC 2016 armv7l armv7l armv7l GNU/Linux
I am just wondering, Is 2GB of the memory too small to properly Caffe?
The GPU info is:
@XiaoMaol On Odroid and similar system-on-a-chip platform, the total system memory (e.g. 2 GB) is shared between all the devices. The GPU doesn't have any dedicated memory like on desktop or server cards. When you run out of this memory, you cannot use the GPU either.
Hey @psyhtest and @gfursin:
I really appreciate your help thus far. Sorry that I run into more problems again. I installed my caffe as suggested by you with the command
I have some problems run the example "00-classification.ipynb," provided by Caffe in the directory
I run by the command
I can perfectly import caffe, and make the cpu load the weights, deploy, and run. But I cannot make the gpu run. the program died in the line
it produced error message
when I opened the debugger, the error message changed into
Can you guys reproduce the error? Have you guy successfully run the Caffe Example on the Mali GPU ?
/////////////////////////////////////////////////////////////////////////////////////////////////////////////// P.S.
Suddenly, it reminded me the command for the installation of ck-caffe in #114 sugggested by @psyhtest :
Espeically
what does flag do? why does it cause the kernel to die?
Furthermore, I have tried to install by command
without the flag
With this option, I can import caffe, but when I am trying to load the weights, the program's kernel died with after line
with error the error in command line is
It died while attemping to load data for weights.