Open naibaf7 opened 6 years ago
Getting OpenCL caffe to run on VC4CL would be great and I will help you in any way I can.
A few issues I know of:
You can always reach me here on github or via doe300@web.de.
@doe300 Thanks for your answer :)
You are not using the right OpenCL Caffe branch (you're using the discontinued one from AMD, I think). The official ones and my upcoming one do not use OpenCL C++. Should be good to go from that perspective. All kernels are in their own compilation unit and are generated at runtime, which also makes it possible to add kernel-quirks based on detection of the VC4CL backend.
It's fine if not all kernels run right away. We have unit tests, and if more complex networks can't be run, this won't be an issue because they wouldn't run fast enough anyways.
Caffe has unit tests, and I will test and approve a "confirmed working" model zoo of models that are giving the same results on VC4CL as well as AMD, nVidia, ARM and Intel's respective implementations (which, by the way, also have some hefty bugs here and there ;))
I'll get back to you as soon as I have more details. I managed to compile VC4C, VC4CL and install the ICD, but my Caffe version reports "CL_DEVICE_NOT_AVAILABLE" when compiling an OpenCL kernel. Any ideas what I'm doing wrong?
By the way... since cross compiling Caffe is not an option and compiling on the Raspberry PI takes a good while (4 hours, maybe?, haven't timed it precisely), do you have a workflow for using emulation of the PI's architecture (QEMU, etc?) that you can share?
my Caffe version reports "CL_DEVICE_NOT_AVAILABLE" when compiling an OpenCL kernel
Do you know where the error is thrown? Can you compile a sample program e.g. by cd VC4C && ./build/VC4C --hex -o /dev/null ./example/hello_world.cl
?
[...] do you have a workflow for using emulation of the PI's architecture [...]
We cross-compile VC4C(L) on CircleCI to check for build errors and provide debian packages. The tests are run on a real Raspberry, I am not aware of an emulator which also includes the VC4 GPU. I recently created an emulator for the VC4 to be able to debug execution of kernel code, but it is not integrated into the VC4CL runtime and needs to be executed seperatly.
No... actually it fails:
./build/VC4C --hex -o ../example/test.clptx ../example/hello_world.cl
[E] Tue Feb 13 00:59:56 2018: (1) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xac [0x76c918c4]
[E] Tue Feb 13 00:59:56 2018: (2) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::intermediate::insertBitcast(vc4c::InstructionWalker, vc4c::Method&, vc4c::Value const&, vc4c::Value const&, vc4c::intermediate::InstructionDecorations)+0x84 [0x76da9cd4]
[E] Tue Feb 13 00:59:56 2018: (3) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::llvm2qasm::Copy::mapInstruction(vc4c::Method&) const+0x12c [0x76e4a680]
[E] Tue Feb 13 00:59:56 2018: (4) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::llvm2qasm::IRParser::mapInstructions(vc4c::llvm2qasm::LLVMMethod&) const+0xe8 [0x76e0372c]
[E] Tue Feb 13 00:59:56 2018: (5) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::llvm2qasm::IRParser::parseMethod()+0x1748 [0x76df3ff8]
[E] Tue Feb 13 00:59:56 2018: (6) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::llvm2qasm::IRParser::parse(vc4c::Module&)+0x6b0 [0x76dece8c]
[E] Tue Feb 13 00:59:56 2018: (7) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::Compiler::convert()+0xfc [0x76c93680]
[E] Tue Feb 13 00:59:56 2018: (8) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::Compiler::compile(std::istream&, std::ostream&, vc4c::Configuration, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)+0x338 [0x76c93de8]
[E] Tue Feb 13 00:59:56 2018: (9) ./build/VC4C : main+0xabc [0x20c64]
[E] Tue Feb 13 00:59:56 2018: (10) /lib/arm-linux-gnueabihf/libc.so.6 : __libc_start_main+0x114 [0x76667678]
[D] Tue Feb 13 00:59:57 2018: Temporary file '/tmp/vc4c-pcl3oM' deleted
[E] Tue Feb 13 00:59:57 2018: Compiler threw exception: Instruction Mapping: Bit-casts across different vector-sizes are not yet supported!
I reckon the clang compiler that comes with debian is not compatible?
No, the default CLang is okay. As the error-message states, the kernel requires a feature which is not yet implemented (bit-casting across different vector-sizes, e.g. bit-casting int2
to short4
, see also https://github.com/doe300/VC4C/issues/35).
So your setup is fine, the kernel is just not yet supported.
This is the result on Caffe:
I0213 20:23:48.559901 3692 caffe.cpp:397] Use GPU with device ID 0
I0213 20:23:48.621196 3692 ocl_device.cpp:61] CL_DEVICE_HOST_UNIFIED_MEMORY: 1
E0213 20:23:48.701433 3692 ocl_device_program.cpp:113] Failed to compile OpenCL binary (d70a82a6) from code (CL_DEVICE_NOT_AVAILABLE)
E0213 20:23:48.729940 3692 ocl_device_program.cpp:160] Failed to load OpenCL kernels (d70a82a6) (CL_INVALID_PROGRAM_EXECUTABLE)
Fails at clBuildProgram: https://github.com/naibaf7/caffe/blob/master/src/caffe/backend/opencl/ocl_device_program.cpp#L109
Can you provide me with an error log or the kernel-code that failed compiling?
This happens:
[D] Wed Feb 14 19:39:21 2018: Mapping LLVM instructions to immediates:
[D] Wed Feb 14 19:39:21 2018: Generating label label %0
[D] Wed Feb 14 19:39:21 2018: Generating bit-cast from i16 %val into <2 x i8> %1
[E] Wed Feb 14 19:39:21 2018: (1) /usr/local/lib/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xac [0x76baf8c4]
[E] Wed Feb 14 19:39:21 2018: (2) /usr/local/lib/libVC4CC.so.1.2 : vc4c::intermediate::insertBitcast(vc4c::InstructionWalker, vc4c::Method&, vc4c::Value const&, vc4c::Value const&, vc4c::intermediate::InstructionDecorations)+0x84 [0x76cc7cd4]
[E] Wed Feb 14 19:39:21 2018: (3) /usr/local/lib/libVC4CC.so.1.2 : vc4c::llvm2qasm::Copy::mapInstruction(vc4c::Method&) const+0x12c [0x76d68680]
[E] Wed Feb 14 19:39:21 2018: (4) /usr/local/lib/libVC4CC.so.1.2 : vc4c::llvm2qasm::IRParser::mapInstructions(vc4c::llvm2qasm::LLVMMethod&) const+0xe8 [0x76d2172c]
[E] Wed Feb 14 19:39:21 2018: (5) /usr/local/lib/libVC4CC.so.1.2 : vc4c::llvm2qasm::IRParser::parseMethod()+0x1748 [0x76d11ff8]
[E] Wed Feb 14 19:39:21 2018: (6) /usr/local/lib/libVC4CC.so.1.2 : vc4c::llvm2qasm::IRParser::parse(vc4c::Module&)+0x6b0 [0x76d0ae8c]
[E] Wed Feb 14 19:39:21 2018: (7) /usr/local/lib/libVC4CC.so.1.2 : vc4c::Compiler::convert()+0xfc [0x76bb1680]
[E] Wed Feb 14 19:39:21 2018: (8) /usr/local/lib/libVC4CC.so.1.2 : vc4c::Compiler::compile(std::istream&, std::ostream&, vc4c::Configuration, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)+0x338 [0x76bb1de8]
[E] Wed Feb 14 19:39:21 2018: (9) VC4C : main+0xabc [0x20c64]
[E] Wed Feb 14 19:39:21 2018: (10) /lib/arm-linux-gnueabihf/libc.so.6 : __libc_start_main+0x114 [0x765a2678]
[D] Wed Feb 14 19:39:22 2018: Temporary file '/tmp/vc4c-7GYr5f' deleted
[E] Wed Feb 14 19:39:22 2018: Compiler threw exception: Instruction Mapping: Bit-casts across different vector-sizes are not yet supported!
On this kernel:
#define int8_t char
#define int16_t short
#define int32_t int
#define int64_t long
#define uint8_t uchar
#define uint16_t ushort
#define uint32_t uint
#define uint64_t ulong
#if defined(cl_khr_fp64)
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#define DOUBLE_SUPPORT_AVAILABLE
#elif defined(cl_amd_fp64)
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
#define DOUBLE_SUPPORT_AVAILABLE
#endif
#if defined(cl_khr_fp16)
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
#define HALF_SUPPORT_AVAILABLE
#endif
#ifdef int_tp
#undef int_tp
#endif //int_tp
#define int_tp int32_t
#ifdef uint_tp
#undef uint_tp
#endif //uint_tp
#define uint_tp uint32_t
#ifdef int_tpc
#undef int_tpc
#endif //int_tpc
#define int_tpc int32_t
#ifdef uint_tpc
#undef uint_tpc
#endif //uint_tpc
#define uint_tpc uint32_t
__kernel void caffe_gpu_memset(const uint32_t n, const char alpha, __global char* y_raw_ptr, const uint_tp y_offset) {
__global char* y = y_raw_ptr + y_offset;
for (uint_tp index = get_global_id(0); index < (n); index += get_global_size(0)) {
y[index] = alpha;
}
}
__kernel void caffe_gpu_null_kernel(float arg) {
float out = arg;
}
But only when I use VC4C directly on it. Which takes a while (2 minutes). From within Caffe, the "CL_DEVICE_NOT_AVAILABLE" appears immediately when clBuildProgram is called.
Compiler threw exception: Instruction Mapping: Bit-casts across different vector-sizes are not yet supported!
I recently implemented support for bit-cast instructions in https://github.com/doe300/VC4C/commit/6d84690f45c9a46a1acefbe5ebc94727fe5767df. The kernel should pass compilation with an updated version of VC4C, at least it does so on my setup.
Which takes a while (2 minutes)
That is probably, because VC4C uses the fall-back LLVM IR parser. If you re-build VC4C with the CMake option LLVMLIB_FRONTEND
enabled, it should take only a few seconds. This requires the llvm-3.9-dev
(or llvm-dev
) package to be installed.
OK I managed to compile the kernel standalone. Still no luck with Caffe though (device not available, as above).
I also noticed that clinfo
reports:
Number of platforms 1
Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU
Platform Vendor doe300
Platform Version OpenCL 1.2 VC4CL 0.4
Platform Profile EMBEDDED_PROFILE
Platform Extensions cl_khr_il_program cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_vc4cl_performance_counters
Platform Extensions function suffix VC4CL
Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU
Number of devices 1
Device Name VideoCore IV GPU
Device Vendor Broadcom
Device Vendor ID 0xa5c
Device Version OpenCL 1.2 VC4CL 0.4
Driver Version 0.4
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Profile EMBEDDED_PROFILE
Max compute units 1
Max clock frequency 300MHz
Device Partition (core)
Max number of sub-devices 0
Supported partition types None
Max work item dimensions 3
Max work item sizes 12x12x12
Max work group size 12
Preferred work group size multiple 1
Preferred / native vector sizes
char 16 / 16
short 16 / 16
int 16 / 16
long 0 / 0
half 0 / 0 (n/a)
float 16 / 16
double 0 / 0 (n/a)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs No
Round to nearest No
Round to zero Yes
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 32, Little-Endian
Global memory size 134217728 (128MiB)
Error Correction support No
Max memory allocation 134217728 (128MiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 64 bytes
Alignment of base address 512 bits (64 bytes)
Global Memory cache type Read/Write
Global Memory cache size <printDeviceInfo:89: get CL_DEVICE_GLOBAL_MEM_CACHE_SIZE : error -30>
Global Memory cache line 64 bytes
Image support No
Local memory type Global
Local memory size 134217728 (128MiB)
Max constant buffer size 134217728 (128MiB)
Max number of constant args 64
Max size of kernel argument 256
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 1ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
printf() buffer size 0
Built-in kernels
Device Available Yes
Compiler Available Yes
Linker Available Yes
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_nv_pragma_unroll cl_arm_core_id cl_ext_atomic_counters_32 cl_khr_initialize_memory
So it says 128MB GPU memory available. I have configured the memory split on the Raspberry PI to have 256MB for the GPU though and run the device headless. Why is it reported lower?
With tests from ViennaCL and CLBlast I get further than with Caffe. I think some OpenCL command that Caffe executes before clBuildProgram kills the context of a VC4CL instance, so I will have to figure out where and why that happens first.
So it says 128MB GPU memory available. I have configured the memory split on the Raspberry PI to have 256MB for the GPU though and run the device headless. Why is it reported lower?
This was on purpose, that only half of the available GPURAM is returned (So there is space for kernel-code and graphics buffers). But you are right, it is confusing, when the number configured in the memory split is not shown. I will change the memory limits in an upcoming commit. Do you have any minimal size of a single buffer, which needs to be supported?
I think some OpenCL command that Caffe executes before clBuildProgram kills the context of a VC4CL instance [...]
Does Caffe compile and link separately, or does it link multiple programs? If so, then the issue might also be fixed in an upcoming commit. I am just waiting for https://github.com/KhronosGroup/SPIRV-Tools/pull/1297 to be merged, so I can push the changes depending on it to VC4C and VC4CL.
@doe300 Minimal buffer size: No, that will depend entirely on what kind of network is used in Caffe. I do not know enough about VC4 GPUs to understand if there are underlying limits to buffer sizes (are there pointer restrictions?). Being able to use as much as possible is always nice for deep learning.
Caffe compiles all kernels (such as the one I posted above) in it's own compilation unit. No sources are linked together, so each is it's own program. The kernels are small enough and don't use enough common code to justify the added complexity of linking it together.
I'm still investigating which OpenCL command triggers the error (it might really be something before clCompileProgram), will get back here as soon as I know. Thanks.
There is of course the limit of GPU memory split, which determines the total GPU memory available. Other than that, it could be that the GPU has a limit for maximum space allocated by a single allocation, but if there is such a limit, it lies above 64 MB.
FWIW most GPU devices will report their total memory size for CL_DEVICE_GLOBAL_MEM_SIZE and a quarter of that (for some odd reason) for CL_DEVICE_MAX_MEM_ALLOC_SIZE.
[...] report their total memory size for CL_DEVICE_GLOBAL_MEM_SIZE and a quarter of that (for some odd reason) for CL_DEVICE_MAX_MEM_ALLOC_SIZE
Probably, because the OpenCL standard states:
CL_DEVICE_MAX_MEM_ALLOC_SIZE | cl_uint | The minimum value is max (1/4th of CL_DEVICE_GLOBAL_MEM_SIZE , 128 * 1024 * 1024) [...] |
Source: OpenCL 1.2 specification, table 4.3
Since VC4CL only supports the embedded profile, this criteria must hold:
CL_DEVICE_MAX_MEM_ALLOC_SIZE | cl_uint | The minimum value is max (1/4th of CL_DEVICE_GLOBAL_MEM_SIZE , 1 * 1024 * 1024) [...] |
Source: OpenCL 1.2 specification, table 10.2 (no table numbering)
@naibaf7 Hi, would you have a recipe to start from for someone to potentially give another try at it a year later ?
@beniz Sure, give it a go and ask if you need to know something. First you'll have to set up a cross-compiling environment for ARMv7, maybe with docker, and set the cross-compiling flags in the CMAKE configuration. Then try to cross-compile for the raspberry pi and see what happens when you move the compiled Caffe version to the raspberry pi and execute it.
Hi, I am the maintainer of OpenCL Caffe (https://github.com/naibaf7/caffe) and (https://github.com/BVLC/caffe/tree/opencl).
I would like to get this running on VC4CL, but I am facing some issues. But at it's core, it should be possible because Caffe can be run solely based on internal OpenCL kernels now (no external OpenCL library dependency, except for ViennaCL's infrastructure (but not kernels). This also means full control on what work group sizes are required by the kernels, and it's possible to add quirks/workarounds specific to VC4CL.
Is there a way to chat with @doe300 directly, as this may get quite involved (due to the scope of OpenCL Caffe).