Milestone: Get OpenCL Caffe to run on VC4CL

doe300 / VC4CL

OpenCL implementation running on the VideoCore IV GPU of the Raspberry Pi models

MIT License

729 stars 81 forks source link

Milestone: Get OpenCL Caffe to run on VC4CL #20

Open naibaf7 opened 6 years ago

naibaf7 commented 6 years ago

Hi, I am the maintainer of OpenCL Caffe (https://github.com/naibaf7/caffe) and (https://github.com/BVLC/caffe/tree/opencl).

I would like to get this running on VC4CL, but I am facing some issues. But at it's core, it should be possible because Caffe can be run solely based on internal OpenCL kernels now (no external OpenCL library dependency, except for ViennaCL's infrastructure (but not kernels). This also means full control on what work group sizes are required by the kernels, and it's possible to add quirks/workarounds specific to VC4CL.

Is there a way to chat with @doe300 directly, as this may get quite involved (due to the scope of OpenCL Caffe).

doe300 commented 6 years ago

Getting OpenCL caffe to run on VC4CL would be great and I will help you in any way I can.

A few issues I know of:

OpenCL caffe (at least the test samples I extracted from it here) uses OpenCL C++, which is not supported (read: was never tested) with VC4C. If the versions of CLang used supports OpenCL C++, this will not be an issue, otherwise a very big one. I will test this within the next few days.
There is currently a major issue with memory access (writing wrong values/not writing the correct ones), so more complex kernels probably don't behave as expected for now.
Since NNs usually run a lot of calculations before outputting a meaningful (at least for me) result, testing could be very hard. So we would need some kind of "unit tests" which can check the correctness of every single step as detailed as possible to determine and fix the exact errors (e.g. test the behaviour of a single neuron).

You can always reach me here on github or via doe300@web.de.

naibaf7 commented 6 years ago

@doe300 Thanks for your answer :)

You are not using the right OpenCL Caffe branch (you're using the discontinued one from AMD, I think). The official ones and my upcoming one do not use OpenCL C++. Should be good to go from that perspective. All kernels are in their own compilation unit and are generated at runtime, which also makes it possible to add kernel-quirks based on detection of the VC4CL backend.
It's fine if not all kernels run right away. We have unit tests, and if more complex networks can't be run, this won't be an issue because they wouldn't run fast enough anyways.
Caffe has unit tests, and I will test and approve a "confirmed working" model zoo of models that are giving the same results on VC4CL as well as AMD, nVidia, ARM and Intel's respective implementations (which, by the way, also have some hefty bugs here and there ;))

I'll get back to you as soon as I have more details. I managed to compile VC4C, VC4CL and install the ICD, but my Caffe version reports "CL_DEVICE_NOT_AVAILABLE" when compiling an OpenCL kernel. Any ideas what I'm doing wrong?

By the way... since cross compiling Caffe is not an option and compiling on the Raspberry PI takes a good while (4 hours, maybe?, haven't timed it precisely), do you have a workflow for using emulation of the PI's architecture (QEMU, etc?) that you can share?

doe300 commented 6 years ago

my Caffe version reports "CL_DEVICE_NOT_AVAILABLE" when compiling an OpenCL kernel

Do you know where the error is thrown? Can you compile a sample program e.g. by cd VC4C && ./build/VC4C --hex -o /dev/null ./example/hello_world.cl?

[...] do you have a workflow for using emulation of the PI's architecture [...]

We cross-compile VC4C(L) on CircleCI to check for build errors and provide debian packages. The tests are run on a real Raspberry, I am not aware of an emulator which also includes the VC4 GPU. I recently created an emulator for the VC4 to be able to debug execution of kernel code, but it is not integrated into the VC4CL runtime and needs to be executed seperatly.

naibaf7 commented 6 years ago

No... actually it fails: ./build/VC4C --hex -o ../example/test.clptx ../example/hello_world.cl

[E] Tue Feb 13 00:59:56 2018:  (1) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xac [0x76c918c4]
[E] Tue Feb 13 00:59:56 2018:  (2) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::intermediate::insertBitcast(vc4c::InstructionWalker, vc4c::Method&, vc4c::Value const&, vc4c::Value const&, vc4c::intermediate::InstructionDecorations)+0x84 [0x76da9cd4]
[E] Tue Feb 13 00:59:56 2018:  (3) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::llvm2qasm::Copy::mapInstruction(vc4c::Method&) const+0x12c [0x76e4a680]
[E] Tue Feb 13 00:59:56 2018:  (4) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::llvm2qasm::IRParser::mapInstructions(vc4c::llvm2qasm::LLVMMethod&) const+0xe8 [0x76e0372c]
[E] Tue Feb 13 00:59:56 2018:  (5) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::llvm2qasm::IRParser::parseMethod()+0x1748 [0x76df3ff8]
[E] Tue Feb 13 00:59:56 2018:  (6) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::llvm2qasm::IRParser::parse(vc4c::Module&)+0x6b0 [0x76dece8c]
[E] Tue Feb 13 00:59:56 2018:  (7) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::Compiler::convert()+0xfc [0x76c93680]
[E] Tue Feb 13 00:59:56 2018:  (8) /home/pi/eth_bsc/VC4/VC4C/build/build/libVC4CC.so.1.2 : vc4c::Compiler::compile(std::istream&, std::ostream&, vc4c::Configuration, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)+0x338 [0x76c93de8]
[E] Tue Feb 13 00:59:56 2018:  (9) ./build/VC4C : main+0xabc [0x20c64]
[E] Tue Feb 13 00:59:56 2018:  (10) /lib/arm-linux-gnueabihf/libc.so.6 : __libc_start_main+0x114 [0x76667678]
[D] Tue Feb 13 00:59:57 2018: Temporary file '/tmp/vc4c-pcl3oM' deleted
[E] Tue Feb 13 00:59:57 2018: Compiler threw exception: Instruction Mapping: Bit-casts across different vector-sizes are not yet supported!

I reckon the clang compiler that comes with debian is not compatible?

doe300 commented 6 years ago

No, the default CLang is okay. As the error-message states, the kernel requires a feature which is not yet implemented (bit-casting across different vector-sizes, e.g. bit-casting int2 to short4, see also https://github.com/doe300/VC4C/issues/35). So your setup is fine, the kernel is just not yet supported.

naibaf7 commented 6 years ago

This is the result on Caffe:

I0213 20:23:48.559901  3692 caffe.cpp:397] Use GPU with device ID 0
I0213 20:23:48.621196  3692 ocl_device.cpp:61] CL_DEVICE_HOST_UNIFIED_MEMORY: 1
E0213 20:23:48.701433  3692 ocl_device_program.cpp:113] Failed to compile OpenCL binary (d70a82a6) from code (CL_DEVICE_NOT_AVAILABLE)
E0213 20:23:48.729940  3692 ocl_device_program.cpp:160] Failed to load OpenCL kernels (d70a82a6) (CL_INVALID_PROGRAM_EXECUTABLE)

Fails at clBuildProgram: https://github.com/naibaf7/caffe/blob/master/src/caffe/backend/opencl/ocl_device_program.cpp#L109

doe300 commented 6 years ago

Can you provide me with an error log or the kernel-code that failed compiling?

naibaf7 commented 6 years ago

This happens:

[D] Wed Feb 14 19:39:21 2018: Mapping LLVM instructions to immediates: 
[D] Wed Feb 14 19:39:21 2018: Generating label label %0
[D] Wed Feb 14 19:39:21 2018: Generating bit-cast from i16 %val into <2 x i8> %1
[E] Wed Feb 14 19:39:21 2018:  (1) /usr/local/lib/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xac [0x76baf8c4]
[E] Wed Feb 14 19:39:21 2018:  (2) /usr/local/lib/libVC4CC.so.1.2 : vc4c::intermediate::insertBitcast(vc4c::InstructionWalker, vc4c::Method&, vc4c::Value const&, vc4c::Value const&, vc4c::intermediate::InstructionDecorations)+0x84 [0x76cc7cd4]
[E] Wed Feb 14 19:39:21 2018:  (3) /usr/local/lib/libVC4CC.so.1.2 : vc4c::llvm2qasm::Copy::mapInstruction(vc4c::Method&) const+0x12c [0x76d68680]
[E] Wed Feb 14 19:39:21 2018:  (4) /usr/local/lib/libVC4CC.so.1.2 : vc4c::llvm2qasm::IRParser::mapInstructions(vc4c::llvm2qasm::LLVMMethod&) const+0xe8 [0x76d2172c]
[E] Wed Feb 14 19:39:21 2018:  (5) /usr/local/lib/libVC4CC.so.1.2 : vc4c::llvm2qasm::IRParser::parseMethod()+0x1748 [0x76d11ff8]
[E] Wed Feb 14 19:39:21 2018:  (6) /usr/local/lib/libVC4CC.so.1.2 : vc4c::llvm2qasm::IRParser::parse(vc4c::Module&)+0x6b0 [0x76d0ae8c]
[E] Wed Feb 14 19:39:21 2018:  (7) /usr/local/lib/libVC4CC.so.1.2 : vc4c::Compiler::convert()+0xfc [0x76bb1680]
[E] Wed Feb 14 19:39:21 2018:  (8) /usr/local/lib/libVC4CC.so.1.2 : vc4c::Compiler::compile(std::istream&, std::ostream&, vc4c::Configuration, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)+0x338 [0x76bb1de8]
[E] Wed Feb 14 19:39:21 2018:  (9) VC4C : main+0xabc [0x20c64]
[E] Wed Feb 14 19:39:21 2018:  (10) /lib/arm-linux-gnueabihf/libc.so.6 : __libc_start_main+0x114 [0x765a2678]
[D] Wed Feb 14 19:39:22 2018: Temporary file '/tmp/vc4c-7GYr5f' deleted
[E] Wed Feb 14 19:39:22 2018: Compiler threw exception: Instruction Mapping: Bit-casts across different vector-sizes are not yet supported!

On this kernel:

#define int8_t char
#define int16_t short
#define int32_t int
#define int64_t long
#define uint8_t uchar
#define uint16_t ushort
#define uint32_t uint
#define uint64_t ulong
#if defined(cl_khr_fp64)
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#define DOUBLE_SUPPORT_AVAILABLE
#elif defined(cl_amd_fp64)
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
#define DOUBLE_SUPPORT_AVAILABLE
#endif
#if defined(cl_khr_fp16)
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
#define HALF_SUPPORT_AVAILABLE
#endif
#ifdef int_tp
#undef int_tp
#endif  //int_tp
#define int_tp int32_t
#ifdef uint_tp
#undef uint_tp
#endif  //uint_tp
#define uint_tp uint32_t
#ifdef int_tpc
#undef int_tpc
#endif  //int_tpc
#define int_tpc int32_t
#ifdef uint_tpc
#undef uint_tpc
#endif  //uint_tpc
#define uint_tpc uint32_t
__kernel void caffe_gpu_memset(const uint32_t n, const char alpha, __global char* y_raw_ptr, const uint_tp y_offset) {
__global char* y = y_raw_ptr + y_offset;
for (uint_tp index = get_global_id(0); index < (n); index += get_global_size(0)) {
y[index] = alpha;
}
}
__kernel void caffe_gpu_null_kernel(float arg) {
float out = arg;
}

But only when I use VC4C directly on it. Which takes a while (2 minutes). From within Caffe, the "CL_DEVICE_NOT_AVAILABLE" appears immediately when clBuildProgram is called.

doe300 commented 6 years ago

Compiler threw exception: Instruction Mapping: Bit-casts across different vector-sizes are not yet supported!

I recently implemented support for bit-cast instructions in https://github.com/doe300/VC4C/commit/6d84690f45c9a46a1acefbe5ebc94727fe5767df. The kernel should pass compilation with an updated version of VC4C, at least it does so on my setup.

Which takes a while (2 minutes)

That is probably, because VC4C uses the fall-back LLVM IR parser. If you re-build VC4C with the CMake option LLVMLIB_FRONTEND enabled, it should take only a few seconds. This requires the llvm-3.9-dev (or llvm-dev) package to be installed.

naibaf7 commented 6 years ago

OK I managed to compile the kernel standalone. Still no luck with Caffe though (device not available, as above).

I also noticed that clinfo reports:

Number of platforms                               1
  Platform Name                                   OpenCL for the Raspberry Pi VideoCore IV GPU
  Platform Vendor                                 doe300
  Platform Version                                OpenCL 1.2 VC4CL 0.4
  Platform Profile                                EMBEDDED_PROFILE
  Platform Extensions                             cl_khr_il_program cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_vc4cl_performance_counters
  Platform Extensions function suffix             VC4CL

  Platform Name                                   OpenCL for the Raspberry Pi VideoCore IV GPU
Number of devices                                 1
  Device Name                                     VideoCore IV GPU
  Device Vendor                                   Broadcom
  Device Vendor ID                                0xa5c
  Device Version                                  OpenCL 1.2 VC4CL 0.4
  Driver Version                                  0.4
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Profile                                  EMBEDDED_PROFILE
  Max compute units                               1
  Max clock frequency                             300MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             12x12x12
  Max work group size                             12
  Preferred work group size multiple              1
  Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                               16 / 16      
    int                                                 16 / 16      
    long                                                 0 / 0       
    half                                                 0 / 0        (n/a)
    float                                               16 / 16      
    double                                               0 / 0        (n/a)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             No
    Round to nearest                              No
    Round to zero                                 Yes
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    32, Little-Endian
  Global memory size                              134217728 (128MiB)
  Error Correction support                        No
  Max memory allocation                           134217728 (128MiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             64 bytes
  Alignment of base address                       512 bits (64 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        <printDeviceInfo:89: get CL_DEVICE_GLOBAL_MEM_CACHE_SIZE : error -30>
  Global Memory cache line                        64 bytes
  Image support                                   No
  Local memory type                               Global
  Local memory size                               134217728 (128MiB)
  Max constant buffer size                        134217728 (128MiB)
  Max number of constant args                     64
  Max size of kernel argument                     256
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
  printf() buffer size                            0
  Built-in kernels                                
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_nv_pragma_unroll cl_arm_core_id cl_ext_atomic_counters_32 cl_khr_initialize_memory

So it says 128MB GPU memory available. I have configured the memory split on the Raspberry PI to have 256MB for the GPU though and run the device headless. Why is it reported lower?

With tests from ViennaCL and CLBlast I get further than with Caffe. I think some OpenCL command that Caffe executes before clBuildProgram kills the context of a VC4CL instance, so I will have to figure out where and why that happens first.

doe300 commented 6 years ago

So it says 128MB GPU memory available. I have configured the memory split on the Raspberry PI to have 256MB for the GPU though and run the device headless. Why is it reported lower?

This was on purpose, that only half of the available GPURAM is returned (So there is space for kernel-code and graphics buffers). But you are right, it is confusing, when the number configured in the memory split is not shown. I will change the memory limits in an upcoming commit. Do you have any minimal size of a single buffer, which needs to be supported?

I think some OpenCL command that Caffe executes before clBuildProgram kills the context of a VC4CL instance [...]

Does Caffe compile and link separately, or does it link multiple programs? If so, then the issue might also be fixed in an upcoming commit. I am just waiting for https://github.com/KhronosGroup/SPIRV-Tools/pull/1297 to be merged, so I can push the changes depending on it to VC4C and VC4CL.

naibaf7 commented 6 years ago

@doe300 Minimal buffer size: No, that will depend entirely on what kind of network is used in Caffe. I do not know enough about VC4 GPUs to understand if there are underlying limits to buffer sizes (are there pointer restrictions?). Being able to use as much as possible is always nice for deep learning.

Caffe compiles all kernels (such as the one I posted above) in it's own compilation unit. No sources are linked together, so each is it's own program. The kernels are small enough and don't use enough common code to justify the added complexity of linking it together.

I'm still investigating which OpenCL command triggers the error (it might really be something before clCompileProgram), will get back here as soon as I know. Thanks.

doe300 commented 6 years ago

There is of course the limit of GPU memory split, which determines the total GPU memory available. Other than that, it could be that the GPU has a limit for maximum space allocated by a single allocation, but if there is such a limit, it lies above 64 MB.

magnumripper commented 6 years ago

FWIW most GPU devices will report their total memory size for CL_DEVICE_GLOBAL_MEM_SIZE and a quarter of that (for some odd reason) for CL_DEVICE_MAX_MEM_ALLOC_SIZE.

doe300 commented 6 years ago

[...] report their total memory size for CL_DEVICE_GLOBAL_MEM_SIZE and a quarter of that (for some odd reason) for CL_DEVICE_MAX_MEM_ALLOC_SIZE

Probably, because the OpenCL standard states:


CL_DEVICE_MAX_MEM_ALLOC_SIZE	cl_uint	The minimum value is max (1/4^th of CL_DEVICE_GLOBAL_MEM_SIZE , 128 * 1024 * 1024) [...]

Source: OpenCL 1.2 specification, table 4.3

Since VC4CL only supports the embedded profile, this criteria must hold:


CL_DEVICE_MAX_MEM_ALLOC_SIZE	cl_uint	The minimum value is max (1/4^th of CL_DEVICE_GLOBAL_MEM_SIZE , 1 * 1024 * 1024) [...]

Source: OpenCL 1.2 specification, table 10.2 (no table numbering)

beniz commented 5 years ago

@naibaf7 Hi, would you have a recipe to start from for someone to potentially give another try at it a year later ?

naibaf7 commented 5 years ago

@beniz Sure, give it a go and ask if you need to know something. First you'll have to set up a cross-compiling environment for ARMv7, maybe with docker, and set the cross-compiling flags in the CMAKE configuration. Then try to cross-compile for the raspberry pi and see what happens when you move the compiled Caffe version to the raspberry pi and execute it.