Encountering problems when running darknet on RPI3B+

Hi,

I tried to run darknet on RPI 3B+ with VC4CL. I can compile the code but encounteredNormalizer: Invalid local type for memory area: (g) f32* %arrayidx29.sink75 when running the program. The darknet version is https://github.com/sowson/darknet. The console output is as follows:

Device IDs: 1
gpuserv: vc_gpuserv_init: starting initialisation
Device ID: 0
Device name: VideoCore IV GPU
Device vendor: Broadcom
Device opencl availability: OpenCL 1.2 VC4CL 0.4.9999 (1acb1b8)
Device opencl used: 0.4.9999
Device double precision: NO
Device max group size: 12
Device address bits: 32
activation_kernel_init
[VC4CL](        darknet): Precompiling source with:
Dumping program sources to /tmp/vc4cl-source-1804289383.cl
[VC4CL](        darknet): Dumping program IR to /tmp/vc4cl-ir-846930886.ll
[VC4CL](        darknet): Precompilation complete with status: 0
[VC4CL](        darknet): Compiling source with:
[VC4CL](        darknet): Dumping program IR to /tmp/vc4cl-ir-1681692777.bc
[VC4CL](        darknet): Compilation complete with status: 0
Dumping program binaries to /tmp/vc4cl-binary-1714636915.bin
blas_kernel_init
opencl_device_id_t==0
opencl_load_buffer
[VC4CL](        darknet): Precompiling source with:
Dumping program sources to /tmp/vc4cl-source-1957747793.cl
[VC4CL](        darknet): Dumping program IR to /tmp/vc4cl-ir-424238335.ll
[VC4CL](        darknet): Precompilation complete with status: 0
[VC4CL](        darknet): Compiling source with:
[VC4CL](        darknet): Dumping program IR to /tmp/vc4cl-ir-719885386.bc
[VC4CL](        darknet): Compilation complete with status: 0
Compilation log: [W] Wed Feb  2 04:11:42 2022: Register conflict resolver has exceeded its maximum rounds, there might still be errors!

Dumping program binaries to /tmp/vc4cl-binary-1102520059.bin
[VC4CL](        darknet): Precompiling source with:
Dumping program sources to /tmp/vc4cl-source-2044897763.cl
[VC4CL](        darknet): Dumping program IR to /tmp/vc4cl-ir-1967513926.ll
[VC4CL](        darknet): Precompilation complete with status: 0
[VC4CL](        darknet): Compiling source with:
[VC4CL](        darknet): Dumping program IR to /tmp/vc4cl-ir-1365180540.bc
[VC4CL](        darknet): Compilation complete with status: 0
Dumping program binaries to /tmp/vc4cl-binary-1540383426.bin
[VC4CL](        darknet): Precompiling source with:
Dumping program sources to /tmp/vc4cl-source-304089172.cl
[VC4CL](        darknet): Dumping program IR to /tmp/vc4cl-ir-1303455736.ll
[VC4CL](        darknet): Precompilation complete with status: 0
[VC4CL](        darknet): Compiling source with:
[VC4CL](        darknet): Dumping program IR to /tmp/vc4cl-ir-35005211.bc
[VC4CL](        darknet): Compilation error: Normalizer: Invalid local type for memory area: (g) f32* %arrayidx29.sink75
[VC4CL](        darknet): Compilation complete with status: -11
Compilation log: Compilation error:
        Normalizer: Invalid local type for memory area: (g) f32* %arrayidx29.sink75
[E] Wed Feb  2 04:12:21 2022: Failed to find memory area for local: (g) f32* %arrayidx29.sink75
[E] Wed Feb  2 04:12:21 2022:   Writer: (g) f32* %arrayidx29.sink75 = (g) f32* %arrayidx33 (ifzc )
[E] Wed Feb  2 04:12:21 2022:   Writer: (g) f32* %arrayidx29.sink75 = (g) f32* %arrayidx29 (ifz )
[E] Wed Feb  2 04:12:21 2022:   Writer: store f32 %add30 into (g) f32* %arrayidx29.sink75 (guarded)
[E] Wed Feb  2 04:12:21 2022:   Reader: f32 %tmp.24427 = load memory at (g) f32* %arrayidx29.sink75 (guarded)
[E] Wed Feb  2 04:12:22 2022:  (1) /usr/lib/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::cha
r_traits<char>, std::allocator<char> > const&)+0xf4 [0x7624ea60]
[E] Wed Feb  2 04:12:22 2022:  (2) /usr/lib/libVC4CC.so.1.2 : +0xe51b14 [0x76610b14]
[E] Wed Feb  2 04:12:22 2022:  (3) /usr/lib/libVC4CC.so.1.2 : vc4c::normalization::determineMemoryAccess(vc4c::Method&)+0x25e0 [0x76614038]
[E] Wed Feb  2 04:12:22 2022:  (4) /usr/lib/libVC4CC.so.1.2 : vc4c::normalization::mapMemoryAccess(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)+0x30 [0x765fc7c4]
[E] Wed Feb  2 04:12:22 2022:  (5) /usr/lib/libVC4CC.so.1.2 : vc4c::normalization::Normalizer::normalizeMethod(vc4c::Module&, vc4c::Method&) const+0x430 [0x766480c0]
[E] Wed Feb  2 04:12:22 2022:  (6) /usr/lib/libVC4CC.so.1.2 : +0xe87cc4 [0x76646cc4]
[E] Wed Feb  2 04:12:22 2022:  (7) /usr/lib/libVC4CC.so.1.2 : +0xe8ab78 [0x76649b78]
[E] Wed Feb  2 04:12:22 2022:  (8) /usr/lib/libVC4CC.so.1.2 : std::function<void (vc4c::Method* const&)>::operator()(vc4c::Method* const&) const+0x54 [0x76257e74]
[E] Wed Feb  2 04:12:22 2022:  (9) /usr/lib/libVC4CC.so.1.2 : vc4c::ThreadPool::scheduleAll<vc4c::Method*, std::vector<vc4c::Method*, std::allocator<vc4c::Method*> > >(std::vector<vc4c::Method*, std::allocator<vc4c::Method*> > const&, std::function<void (
vc4c::Method* const&)> const&, logging::Logger*)::{lambda()#1}::operator()() const+0x2c [0x762564b0]
[E] Wed Feb  2 04:12:22 2022:  (10) /usr/lib/libVC4CC.so.1.2 : std::_Function_handler<void (), vc4c::ThreadPool::scheduleAll<vc4c::Method*, std::vector<vc4c::Method*, std::allocator<vc4c::Method*> > >(std::vector<vc4c::Method*, std::allocator<vc4c::Method
*> > const&, std::function<void (vc4c::Method* const&)> const&, logging::Logger*)::{lambda()#1}>::_M_invoke(std::_Any_data const&)+0x24 [0x7625940c]
[E] Wed Feb  2 04:12:22 2022:  (11) /usr/lib/libVC4CC.so.1.2 : std::function<void ()>::operator()() const+0x40 [0x762ded68]
[E] Wed Feb  2 04:12:22 2022:  (12) /usr/lib/libVC4CC.so.1.2 : +0xb1a8f0 [0x762d98f0]
[E] Wed Feb  2 04:12:22 2022:  (13) /usr/lib/libVC4CC.so.1.2 : +0xb1d270 [0x762dc270]
[E] Wed Feb  2 04:12:22 2022:  (14) /usr/lib/libVC4CC.so.1.2 : +0xb1d240 [0x762dc240]
[E] Wed Feb  2 04:12:22 2022:  (15) /usr/lib/libVC4CC.so.1.2 : +0xb1d1c4 [0x762dc1c4]
[E] Wed Feb  2 04:12:22 2022:  (16) /usr/lib/libVC4CC.so.1.2 : +0xb1d198 [0x762dc198]
[E] Wed Feb  2 04:12:22 2022:  (17) /usr/lib/libVC4CC.so.1.2 : +0xb1d0dc [0x762dc0dc]
[E] Wed Feb  2 04:12:22 2022:  (18) /usr/lib/libVC4CC.so.1.2 : +0xb1cf10 [0x762dbf10]
[E] Wed Feb  2 04:12:22 2022:  (19) /usr/lib/libVC4CC.so.1.2 : +0xb1cd5c [0x762dbd5c]
[E] Wed Feb  2 04:12:22 2022:  (20) /usr/lib/libVC4CC.so.1.2 : std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>::operator()() const+0x48 [0x762de2a4]
[E] Wed Feb  2 04:12:22 2022:  (21) /usr/lib/libVC4CC.so.1.2 : std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)+0x28 [0x762dd374]
[E] Wed Feb  2 04:12:22 2022:  (22) /usr/lib/libVC4CC.so.1.2 : void std::__invoke_impl<void, void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool
*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::__invoke_memfun_deref, void (std::__future_base::_State_baseV2::*&&)(std::function<std::u
nique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&$
 bool*&&)+0xb0 [0x762e0050]
[E] Wed Feb  2 04:12:22 2022:  (23) /usr/lib/libVC4CC.so.1.2 : std::result_of<void (std::__future_base::_State_baseV2::*&&(std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_ba
se::_Deleter> ()>*&&, bool*&&))(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)>::type std::__invoke<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__
future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(void (std::__
future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base
, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)+0x64 [0x762df248]
[E] Wed Feb  2 04:12:22 2022:  (24) /usr/lib/libVC4CC.so.1.2 : std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__futu
re_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_ba
se::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#1}::o
perator()() const+0x70 [0x762ddb18]
[E] Wed Feb  2 04:12:22 2022:  (25) /usr/lib/libVC4CC.so.1.2 : std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__futu
re_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_ba
se::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#2}::o
perator()() const+0x30 [0x762ddbe4]
[E] Wed Feb  2 04:12:22 2022:  (26) /usr/lib/libVC4CC.so.1.2 : std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__futu
re_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_ba
se::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#2}::_
FUN()+0x10 [0x762ddc04]
[E] Wed Feb  2 04:12:22 2022:  (27) /lib/arm-linux-gnueabihf/libpthread.so.0 : +0xf158 [0x76eff158]
[E] Wed Feb  2 04:12:22 2022:  (27) /lib/arm-linux-gnueabihf/libpthread.so.0 : +0xf158 [0x76eff158]
[E] Wed Feb  2 04:12:22 2022:  (28) /usr/lib/libVC4CC.so.1.2 : +0xb1a34c [0x762d934c]
[E] Wed Feb  2 04:12:22 2022:  (29) /usr/lib/libVC4CC.so.1.2 : void std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::_
_future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__futu
re_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)+0xac [0x762
ddcec]
[E] Wed Feb  2 04:12:22 2022:  (30) /usr/lib/libVC4CC.so.1.2 : std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool)+0x94 [0x762dcd9c]
[E] Wed Feb  2 04:12:22 2022:  (31) /usr/lib/libVC4CC.so.1.2 : +0xb1c688 [0x762db688]
[E] Wed Feb  2 04:12:22 2022:  (32) /usr/lib/libVC4CC.so.1.2 : std::packaged_task<void ()>::operator()()+0x40 [0x762df15c]
[E] Wed Feb  2 04:12:22 2022:  (33) /usr/lib/libVC4CC.so.1.2 : vc4c::ThreadPool::workerTask(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x194 [0x762d9c5c]
[E] Wed Feb  2 04:12:22 2022:  (34) /usr/lib/libVC4CC.so.1.2 : +0xb1a644 [0x762d9644]
[E] Wed Feb  2 04:12:22 2022:  (35) /usr/lib/libVC4CC.so.1.2 : +0xb1cadc [0x762dbadc]
[E] Wed Feb  2 04:12:22 2022:  (36) /usr/lib/libVC4CC.so.1.2 : +0xb1c868 [0x762db868]
[E] Wed Feb  2 04:12:22 2022:  (37) /usr/lib/libVC4CC.so.1.2 : +0xb1c550 [0x762db550]
[E] Wed Feb  2 04:12:22 2022:  (38) /usr/lib/arm-linux-gnueabihf/libstdc++.so.6 : +0x9d9b0 [0x76e2e9b0]
[E] Wed Feb  2 04:12:22 2022: Compiler threw exception: Normalizer: Invalid local type for memory area: (g) f32* %arrayidx29.sink75

Dumping program binaries to /tmp/vc4cl-binary-294702567.bin
FATAL ERROR: CL_UNKNOWN_ERROR
opencl_load: could not compile. error: CL_UNKNOWN_ERROR
code:
Compilation error:
        Normalizer: Invalid local type for memory area: (g) f32* %arrayidx29.sink75
[E] Wed Feb  2 04:12:21 2022: Failed to find memory area for local: (g) f32* %arrayidx29.sink75
[E] Wed Feb  2 04:12:21 2022:   Writer: (g) f32* %arrayidx29.sink75 = (g) f32* %arrayidx33 (ifzc )
[E] Wed Feb  2 04:12:21 2022:   Writer: (g) f32* %arrayidx29.sink75 = (g) f32* %arrayidx29 (ifz )
[E] Wed Feb  2 04:12:21 2022:   Writer: store f32 %add30 into (g) f32* %arrayidx29.sink75 (guarded)
[E] Wed Feb  2 04:12:21 2022:   Reader: f32 %tmp.24427 = load memory at (g) f32* %arrayidx29.sink75 (guarded)
[E] Wed Feb  2 04:12:22 2022:  (1) /usr/lib/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::cha
r_traits<char>, std::allocator<char> > const&)+0xf4 [0x7624ea60]
[E] Wed Feb  2 04:12:22 2022:  (2) /usr/lib/libVC4CC.so.1.2 : +0xe51b14 [0x76610b14]
[E] Wed Feb  2 04:12:22 2022:  (3) /usr/lib/libVC4CC.so.1.2 : vc4c::normalization::determineMemoryAccess(vc4c::Method&)+0x25e0 [0x76614038]
[E] Wed Feb  2 04:12:22 2022:  (4) /usr/lib/libVC4CC.so.1.2 : vc4c::normalization::mapMemoryAccess(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)+0x30 [0x765fc7c4]
[E] Wed Feb  2 04:12:22 2022:  (5) /usr/lib/libVC4CC.so.1.2 : vc4c::normalization::Normalizer::normalizeMethod(vc4c::Module&, vc4c::Method&) const+0x430 [0x766480c0]
[E] Wed Feb  2 04:12:22 2022:  (6) /usr/lib/libVC4CC.so.1.2 : +0xe87cc4 [0x76646cc4]
[E] Wed Feb  2 04:12:22 2022:  (7) /usr/lib/libVC4CC.so.1.2 : +0xe8ab78 [0x76649b78]
[E] Wed Feb  2 04:12:22 2022:  (8) /usr/lib/libVC4CC.so.1.2 : std::function<void (vc4c::Method* const&)>::operator()(vc4c::Method* const&) const+0x54 [0x76257e74]
[E] Wed Feb  2 04:12:22 2022:  (9) /usr/lib/libVC4CC.so.1.2 : vc4c::ThreadPool::scheduleAll<vc4c::Method*, std::vector<vc4c::Method*, std::allocator<vc4c::Method*> > >(std::vector<vc4c::Method*, std::allocator<vc4c::Method*> > const&, std::function<void (
vc4c::Method* const&)> const&, logging::Logger*)::{lambda()#1}::operator()() const+0x2c [0x762564b0]
[E] Wed Feb  2 04:12:22 2022:  (10) /usr/lib/libVC4CC.so.1.2 : std::_Function_handler<void (), vc4c::ThreadPool::scheduleAll<vc4c::Method*, std::vector<vc4c::Method*, std::allocator<vc4c::Method*> > >(std::vector<vc4c::Method*, std::allocator<vc4c::Method
*> > const&, std::function<void (vc4c::Method* const&)> const&, logging::Logger*)::{lambda()#1}>::_M_invoke(std::_Any_data const&)+0x24 [0x7625940c]
[E] Wed Feb  2 04:12:22 2022:  (11) /usr/lib/libVC4CC.so.1.2 : std::function<void ()>::operator()() const+0x40 [0x762ded68]
[E] Wed Feb  2 04:12:22 2022:  (12) /usr/lib/libVC4CC.so.1.2 : +0xb1a8f0 [0x762d98f0]
[E] Wed Feb  2 04:12:22 2022:  (13) /usr/lib/libVC4CC.so.1.2 : +0xb1d270 [0x762dc270]
[E] Wed Feb  2 04:12:22 2022:  (14) /usr/lib/libVC4CC.so.1.2 : +0xb1d240 [0x762dc240]
[E] Wed Feb  2 04:12:22 2022:  (15) /usr/lib/libVC4CC.so.1.2 : +0xb1d1c4 [0x762dc1c4]
[E] Wed Feb  2 04:12:22 2022:  (16) /usr/lib/libVC4CC.so.1.2 : +0xb1d198 [0x762dc198]
[E] Wed Feb  2 04:12:22 2022:  (17) /usr/lib/libVC4CC.so.1.2 : +0xb1d0dc [0x762dc0dc]
[E] Wed Feb  2 04:12:22 2022:  (18) /usr/lib/libVC4CC.so.1.2 : +0xb1cf10 [0x762dbf10]
[E] Wed Feb  2 04:12:22 2022:  (19) /usr/lib/libVC4CC.so.1.2 : +0xb1cd5c [0x762dbd5c]
[E] Wed Feb  2 04:12:22 2022:  (20) /usr/lib/libVC4CC.so.1.2 : std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>::operator()() const+0x48 [0x762de2a4]
[E] Wed Feb  2 04:12:22 2022:  (21) /usr/lib/libVC4CC.so.1.2 : std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)+0x28 [0x762dd374]
[E] Wed Feb  2 04:12:22 2022:  (22) /usr/lib/libVC4CC.so.1.2 : void std::__invoke_impl<void, void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::__invoke_memfun_deref, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)+0xb0 [0x762e0050]
[E] Wed Feb  2 04:12:22 2022:  (23) /usr/lib/libVC4CC.so.1.2 : std::result_of<void (std::__future_base::_State_baseV2::*&&(std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&))(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)>::type std::__invoke<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)+0x64 [0x762df248]
[E] Wed Feb  2 04:12:22 2022:  (24) /usr/lib/libVC4CC.so.1.2 : std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#1}::operator()() const+0x70 [0x762ddb18]
[E] Wed Feb  2 04:12:22 2022:  (25) /usr/lib/libVC4CC.so.1.2 : std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#2}::operator()() const+0x30 [0x762ddbe4]
[E] Wed Feb  2 04:12:22 2022:  (26) /usr/lib/libVC4CC.so.1.2 : std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#2}::_FUN()+0x10 [0x762ddc04]
[E] Wed Feb  2 04:12:22 2022:  (27) /lib/arm-linux-gnueabihf/libpthread.so.0 : +0xf158 [0x76eff158]
[E] Wed Feb  2 04:12:22 2022:  (28) /usr/lib/libVC4CC.so.1.2 : +0xb1a34c [0x762d934c]
[E] Wed Feb  2 04:12:22 2022:  (29) /usr/lib/libVC4CC.so.1.2 : void std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)+0xac [0x762ddcec]
[E] Wed Feb  2 04:12:22 2022:  (30) /usr/lib/libVC4CC.so.1.2 : std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool)+0x94 [0x762dcd9c]
[E] Wed Feb  2 04:12:22 2022:  (31) /usr/lib/libVC4CC.so.1.2 : +0xb1c688 [0x762db688]
[E] Wed Feb  2 04:12:22 2022:  (32) /usr/lib/libVC4CC.so.1.2 : std::packaged_task<void ()>::operator()()+0x40 [0x762df15c]
[E] Wed Feb  2 04:12:22 2022:  (33) /usr/lib/libVC4CC.so.1.2 : vc4c::ThreadPool::workerTask(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x194 [0x762d9c5c]
[E] Wed Feb  2 04:12:22 2022:  (34) /usr/lib/libVC4CC.so.1.2 : +0xb1a644 [0x762d9644]
[E] Wed Feb  2 04:12:22 2022:  (35) /usr/lib/libVC4CC.so.1.2 : +0xb1cadc [0x762dbadc]
[E] Wed Feb  2 04:12:22 2022:  (36) /usr/lib/libVC4CC.so.1.2 : +0xb1c868 [0x762db868]
[E] Wed Feb  2 04:12:22 2022:  (37) /usr/lib/libVC4CC.so.1.2 : +0xb1c550 [0x762db550]
[E] Wed Feb  2 04:12:22 2022:  (38) /usr/lib/arm-linux-gnueabihf/libstdc++.so.6 : +0x9d9b0 [0x76e2e9b0]
[E] Wed Feb  2 04:12:22 2022: Compiler threw exception: Normalizer: Invalid local type for memory area: (g) f32* %arrayidx29.sink75

I found that the code to cause this problem is the function called opencl_load_buffer, which I list below:

void opencl_load_buffer(const char *buffer, const size_t size, cl_program *output)
{
    cl_int clErr;

    *output = clCreateProgramWithSource(opencl_context, 1,
                                        (const char**)&buffer, &size, &clErr);

    if (clErr != CL_SUCCESS)
    {
        printf("opencl_load: could not create program. error: %s\n", clCheckError(clErr));
        exit(-1);
    }

#ifdef ARM
    clErr = clBuildProgram(
            *output,
            1,
            &opencl_devices[opencl_device_id_t],
            NULL, NULL, NULL);
#else
    clErr = clBuildProgram(
            *output,
            1,
            &opencl_devices[opencl_device_id_t],
            "-Werror "
            "-cl-std=CL1.2 "
            "-cl-opt-disable "
          //"-cl-denorms-are-zero "
          //"-cl-fp32-correctly-rounded-divide-sqrt "
            "-cl-no-signed-zeros "
            "-cl-mad-enable "
          //"-cl-fast-relaxed-math "
            , NULL, NULL);
#endif

    if (clErr != CL_SUCCESS)
    {
        printf("opencl_load: could not compile. error: %s\n", clCheckError(clErr));
        size_t len;
        char *ebuffer = (char*)calloc(0x10000000, sizeof(char));
        clGetProgramBuildInfo(*output, opencl_devices[opencl_device_id_t], CL_PROGRAM_BUILD_LOG, 0x10000000 * sizeof(char), ebuffer, &len);
        printf("code:\n%s\n", ebuffer);
        free(ebuffer);
        exit(-1);
    }
}

Do you know how can I fix this problem? Is it because the function invokes some unimplemented library functions?

Thanks!

This looks like a VC4CC compiler bug/missing support. Can you post the offending OpenCL C source code? E.g. in the above log you see lines like /tmp/vc4cl-source-<some random number>.cl, can you post the last /tmp/vc4cl-source-xyz.cl and /tmp/vc4cl-ir-abc.bc files dumped before the error is given?

Also, since you are using a source build of VC4CC, what is your LLVM/clang version you built VC4CC against? This can be queried e.g. by <path/to/>vc4c --version

Thanks for your quick reply! I will post the two files as follows.

I cat the /tmp/vc4cl-source-xyz.cl:

__kernel void weighted_delta_kernel(int n, __global float *a, __global float *b, __global float *s, __global float *da, __global float *db, __global float *ds, __global float *dc) { int i = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0); if(i < n){ if(da) da[i] += dc[i] * s[i]; db[i] += dc[i] * (1-s[i]); ds[i] += dc[i] * a[i] + dc[i] * -b[i]; } } __kernel void mult_add_into_kernel(int n, __global float *a, __global float *b, __global float *c) { int i = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0); if(i < n){ c[i] += a[i]*b[i]; } } __kernel void deinter_kernel(int NX, __global float *X, int NY, __global float *Y, int B, __global float *OUT) { int i = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0); if(i < (NX+NY)*B){ int b = i / (NX+NY); int j = i % (NX+NY); if (j < NX){ if(X) X[b*NX + j] += OUT[i]; } else { if(Y) Y[b*NY + j - NX] += OUT[i]; } } } __kernel void inter_kernel(int NX, __global float *X, int NY, __global float *Y, int B, __global float *OUT) { int i = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0); if(i < (NX+NY)*B){ int b = i / (NX+NY); int j = i % (NX+NY); if (j < NX){ OUT[i] = X[b*NX + j]; } else { OUT[i] = Y[b*NY + j - NX]; } } } __kernel void softmax_device(__global float *input, int n, float temp, int stride, __global float *output) { int i; float sum = 0; float largest = -FLT_MAX; for(i = 0; i < n; ++i){ int val = input[i*stride]; largest = (val>largest) ? val : largest; } for(i = 0; i < n; ++i){ float e = exp(input[i*stride]/temp - largest/temp); sum += e; output[i*stride] = e; } for(i = 0; i < n; ++i){ output[i*stride] /= sum; } } __kernel void softmax_kernel(__global float *input, int offset, int n, int batch, int batch_offset, int groups, int group_offset, int stride, float temp, __global float *output) { int id = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0); if (id >= batch*groups) return; int b = id / groups; int g = id % groups; softmax_device(input + b*batch_offset + g*group_offset + offset, n, temp, stride, output + b*batch_offset + g*group_offset + offset); } __kernel void softmax_tree_kernel(__global float *input, int offset, int index, int spatial, int batch, int stride, float temp, __global float *output, int groups, __global float *group_size, __global float *group_offset) { int id = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0); if (id >= spatial*batch*groups) return; int s = id % spatial; id = id / spatial; int g = id % groups; int b = id / groups; int goff = group_offset[g]*spatial; int boff = b*stride; softmax_device(input + offset + goff + boff + s, group_size[g], temp, spatial, output + offset + goff + boff + s); } __kernel void scale_mask_kernel(int n, __global float *x, float mask_num, __global float *mask, float scale) { int i = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0); if(i < n && mask[i] == mask_num) x[i] *= scale; } __kernel void dot_kernel(__global float *output, float scale, int batch, int n, int size, __global float *delta) { int index = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0); int f1 = index / n; int f2 = index % n; if (f2 <= f1) return; float sum = 0; float norm1 = 0; float norm2 = 0; int b, i; for(b = 0; b < batch; ++b){ for(i = 0; i < size; ++i){ int i1 = b * size * n + f1 * size + i; int i2 = b * size * n + f2 * size + i; sum += output[i1] * output[i2]; norm1 += output[i1] * output[i1]; norm2 += output[i2] * output[i2]; } } norm1 = sqrt(fabs(norm1)); norm2 = sqrt(fabs(norm2)); float norm = norm1 * norm2; sum = sum / norm; for(b = 0; b < batch; ++b){ for(i = 0; i < size; ++i){ int i1 = b * size * n + f1 * size + i; int i2 = b * size * n + f2 * size + i; delta[i1] += - scale * sum * output[i2] / norm; delta[i2] += - scale * sum * output[i1] / norm; } } } __kernel void upsample_kernel(int N, __global float *x, int w, int h, int c, int batch, int stride, int forward, float scale, __global float *out) { int i = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0); if(i >= N) return; int out_index = i; int out_w = i%(w*stride); i = i/(w*stride); int out_h = i%(h*stride); i = i/(h*stride); int out_c = i%c; i = i/c; int b = i%batch; int in_w = out_w / stride; int in_h = out_h / stride; int in_c = out_c; int in_index = b*w*h*c + in_c*w*h + in_h*w + in_w; if(forward) out[out_index] += scale * x[in_index]; else x[in_index] += scale * out[out_index]; } __kernel void gemm_kernel( int tuning, __local float* sums, int TA, int TB, int M, int N, int K, __const float ALPHA, __global float *A, int offset_A, int lda, __global float *B, int offset_B, int ldb, __const float BETA, __global float *C, int offset_C, int ldc) { int td = get_global_id(0); if (td > tuning) return; int id = get_global_id(1); if (id > N*M) return; int iM = id / N; int jN = id % N; int kK = 0; int ts = 0; C[iM * ldc + jN + offset_C] *= BETA; sums[td] = 0; for(kK = td; kK < K; kK += tuning) { if (TA==0 && TB==0) { sums[td] += ALPHA * A[iM * lda + kK + offset_A] * B[kK * ldb + jN + offset_B]; } else if (TA==1 && TB==0) { sums[td] += ALPHA * A[kK * lda + iM + offset_A] * B[kK * ldb + jN + offset_B]; } else if (TA==0 && TB==1) { sums[td] += ALPHA * A[iM * lda + kK + offset_A] * B[jN * ldb + kK + offset_B]; } else { sums[td] += ALPHA * A[iM + kK * lda + offset_A] * B[kK + jN * ldb + offset_B]; } } barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE); if (td == 0) { for(ts = 0; ts < tuning; ++ts) { if (TA==0 && TB==0) { C[iM * ldc + jN + offset_C] += sums[ts]; } else if (TA==1 && TB==0) { C[iM * ldc + jN + offset_C] += sums[ts]; } else if (TA==0 && TB==1) { C[iM * ldc + jN + offset_C] += sums[ts]; } else { C[iM * ldc + jN + offset_C] += sums[ts]; } } } } __kernel void scal_add_kernel(int N, float ALPHA, float BETA, __global float *X, int OFFX, int INCX) { int i = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0); if (i < N) { X[i*INCX + OFFX] *= ALPHA; X[i*INCX + OFFX] += BETA; } } __kernel void mean_array_kernel(int N, float alpha, __global float *s, __global float *a) { int i = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0); if (i >= N) return; a[i] *= (1 - alpha) + s[i]; a[i] *= alpha; s[i] = a[i]; }

As you see, the dumped code lies in one line and is difficult to read. Is it right? Or how can I make the file more readable?

Besides, I tried to cat /tmp/vc4cl-ir-abc.bc, but what I got is completely unreadable. It's like

 #6667:23Q4\F                                                                                                                                                                                                                                             [5/5]

0!!<;;=<C8a F,40F 0"GȐQdadbdsdc111
 0
  ((bba9D>?#u`P3bPP7    e`I#u`PT 0bPP7  eS      MAAUB
A
 7@
   lApYa L,A0 T¨!WB0*`؀% *H`RA0

@ x¨`a"alA0a4(F4(F
                  0F7 ?D"
                        `R
E
 ]D!ep ?DcK
E
 W[U

    >
     "
      %@2\xn;<C0ԂېD!U&D0
XQ                      %
  C]4r
      NL!R[P,,, `HOH
                    2`PaCODؓD!UU0RcLN850Ժ,T0RW O4
                                                1[?@.<Q2@A
U[+h-a
      `me
b
 62A[`A NC<
Q@#u`P2bPP71    e1#u`PR 0bPP7   e5%2bPP7        e=%04hF@[a@
                                                           0l@
                                                             Vc)`*A`+`E

                                                                       C0
                                                                         HOXK!RP,,D3E   H`.,m8  E
<>20A                                                                                            n,,$@2\01      [L47ۃ
Q[)ha G,4#u`P2bPP7      e0=#u`PVS 0bPP7 eA%2bPP7        eI%`4"00h0AU@HA
                                                                       7D0`E`0`
                                                                               )@vh"`[aT10K`
                                                                                            7p2Ea(=` AT (F
                                                                                                          0>CðD!eQ`pHOR
                                                                                                                       m@tMm@tM}RP]S
                                                                                                                                    _AtM[JJ [ˀ
P                                                                                                                                             X
 n;ÀTQ`pO!0O]Ʀ"
               0 Q)@2\n9C-,,,8b2C-fPa&T0R6TPfTPvLN8K>H,g O4
1
 [
  8n<:2TA
[[`PA ha gƈAA$AMAHΈAA$AT HMApO  MApQ    X
                                        7
A                                        2EP@RF
 fX"P0K@
        T
         r0lA
@YÁ6vCð]>
         6}RA4' ?DSۖĶ2nkDFKHX 0
                              3

How should I transfer the code?

As for the LLVM/Clang version, what do you mean by <path/to/>vc4c? Is it the build dir that I make during compiling VC4C? I tried to check the clang version by clang --version and got

clang version 7.0.1-8+rpi3+deb10u2 (tags/RELEASE_701/final)
Target: armv6k-unknown-linux-gnueabihf
Thread model: posix
InstalledDir: /usr/bin

The /tmp/vc4cl-ir-abc.bc is a binary file, it's a raw LLVM module, you should be able to attach files to a GitHub comment, see the text below the comment input box. Yes, the vc4c executable lies in the VC4C build directory, more precisely <VC4C build dir>/src/vc4c.

I'm not sure whether I have time today to debug this issue though.

The LLVM/Clang version after typing vc4c --version is as follows:

Running VC4C in version: 0.4.9999 (unknown)
Build configuration: debug mode; builtin SPIR-V front-end; LLVM library front-end with libLLVM 7; vc4asm verification

Standard library location:
    header in /home/pi/opencl/VC4C/../VC4CLStdLib/include/defines.h
    PCH in /home/pi/opencl/VC4C/../VC4CLStdLib/include/VC4CLStdLib.h.pch
    LLVM module in /home/pi/opencl/VC4C/../VC4CLStdLib/include/VC4CLStdLib.bc
    SPIR-V module in
Tool locations:
    clang in /usr/bin/clang (default)
    llvm-spirv not in $PATH
    llvm-link in /usr/bin/llvm-link-7 (default)
    llvm-dis in /usr/bin/llvm-dis-7 (default)
    llvm-as in /usr/bin/llvm-as-7 (default)
    spirv-link not in $PATH

I have packed the bc and cl file into a zip file because the file format is not permitted by github. archive.zip

It's okay you may have no time today. You can help me with this problem when you are available. Thanks!

hi @doe300 , I saw you made some commits. Is that mean the problem is solved?

With the changes (not yet on master branch, but on devel) I got your OpenCL C kernel to build. I will test them on my setup and then hopefully push them on master within the next few days. If you want to test it earlier, you'll only need these 8 lines from https://github.com/doe300/VC4C/commit/90d16ae1757223d028d9c9e300adf955029c2ab2#diff-e956b9578001223db5e36f75faa7ec5fe604f044c3219a2a1e6795cc79313208R259.

I have tested the 8 lines. The memory error disappeared, but I got another problem: the program stucks at that point. I'm not sure if it is an infinite loop or other things. I have attached the .cl and .bc file into this comment archive.zip .

Can you check what threads are running (and how "active" are they) when the program is "stuck"? This can be best checked e.g. with the htop tool, when enabling thread names in the settings (F2 → Display options → Show custom thread names).

I think the thread name is Normalization. I can see there are four Normalization threads when the function begins, later there is only one Normalization remained and the program is stuck. I can also see that the last Normalization occupies 100% CPU, which means it may drop in an infinite loop?

I also observed the change of htop and I guess the procedure is like this. First, there are four concurrent Normalization functions, each of them occupies one CPU. I can see all the four CPUs are 100% occupied. Later, three Normalization functions terminated, only one remained. I can see one CPU is 100% occupied at this time. In the picture above, for example, it is CPU2.

BTW I can also see other threads, such as VC4CL Queue Han, alsa-sink-bcm28, but they do not occupy CPU when the program Is stuck.

Which command are you using? I am currently trying the example from https://pjreddie.com/darknet/yolo/ and it seems to compile just fine, also I do run in a SEGFAULT afterwards...

The darknet is this one https://github.com/sowson/darknet. The command to compile this darknet follows https://iblog.isowa.io/2020/04/29/darknet-in-opencl-on-beagleboard-ai/ :

git clone git clone https://github.com/sowson/darknet.git
cd darknet
vi CMakeList.txt # or nano or joe or vim… set DARKNET_ARM=ON
vi src/opencl.h # add in ARM definition: #define CL_TARGET_OPENCL_VERSION 220
mkdir build
cd build
cmake ..
make

Then you'd better edit cfg/cifar.cfg to reduce the batch_size. Change the 2nd line of cfg/cifar.cfg from

batch=128

batch=4

Then in the build directory, the command to run is ./darknet cifar test ../cfg/cifar.cfg

Seems I can't reproduce your issue on my setup. On my Raspberry Pi it compiles for several minutes, then prints some layer information and fails with being unable to open a file.

I am currently running a large set of regression tests, if none of these fail, I will merge the devel branch into master (probably around tomorrow), maybe that fixes the issue.

That's weird. Do you use RPI 3B+ too? I think printing some layer information is good and it means you didn't encounter my problem.

Also, did you made the changes on darknet? eg set DARKNET_ARM=ON and add in ARM definition: #define CL_TARGET_OPENCL_VERSION 220

I tested on a new system and it worked! But I feel that the setup phase is pretty slow, I think this phase is what you said ''compiles for several minutes''? I was wondering why it needs to compile every time I start the program?

Yes, the compilation takes some time, which is a combination of probably running in Debug build, the compiler itself not being too fast and the weak hardware.

I was wondering why it needs to compile every time I start the program?

Since neither the application (darknet) nor the OpenCL implementation (VC4CL) caches the compiled binaries. I thought about that, but since VC4CL still in active development, it didn't make too much sense for me at least and would just introduces consistency issues.

I do not quite understand what you mean by Debug build. Do you mean VC4CL or darknet are compiled in debug mode? I checked the configuration and found that neither of them is built in debug mode.

BTW I was wondering have you tested the computation speed between VC4CL and CPU. For example, for matrix multiplication, how much is VC4CL faster than pure CPU?

VC4C(L), if not set otherwise explicitly (e.g. by specifying the -DCMAKE_BUILD_TYPE=Release CMake flag, they compile as debug builds without optimizations enabled.

BTW I was wondering have you tested the computation speed between VC4CL and CPU. For example, for matrix multiplication, how much is VC4CL faster than pure CPU?

No, not extensively, I do have very few comparisons where the results differ greatly (sometimes CPU is faster, sometimes VC4CL). I expect that it will highly depend on the code being executed (e.g. how easily it can be parallelized, how much memory it accesses, etc.).

Today I went deeper into the project but I got another problem. I prepared a code base at https://github.com/ziqi-zhang/darknet-vc4cl so that you will not encounter the missing file problem. To build this repo, you can directly run mkdir build && cd build && cmake .. && make. The command to run in the build dir is ./darknet cifar test ../cfg/cifar.cfg.

One problem is that I got FATAL ERROR: CL_INVALID_WORK_GROUP_SIZE and opencl gemm_kernel error: CL_INVALID_WORK_GROUP_SIZE. The function gemm_offset_gpu invokes opencl_kernel_local. When opencl_kernel_local invokes clEnqueueNDRangeKernel, it seems the returned value is an error.

I checked the passed arguments of clEnqueueNDRangeKernel. The globalOffset is 0, 0, globalItems is 32, 100352, localItems is 32, 1. Do you have any idea why these sizes are invalid?

Another problem I occasionally encountered is erroneous register. This error does not raise every time, but only once in a while. I captured some error log as follows:

FATAL ERROR: CL_UNKNOWN_ERROR
opencl_load: could not compile. error: CL_UNKNOWN_ERROR
code:
Compilation error:
        Label/Register Mapping: There are erroneous register-associations!
[W] Sat Feb  5 01:34:24 2022: Register conflict resolver has exceeded its maximum rounds, there might still be errors!
[E] Sat Feb  5 01:35:04 2022: Error assigning local to register: %group_id_y
[E] Sat Feb  5 01:35:04 2022:  (1) /usr/local/lib/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xac [0x761ee894]
[E] Sat Feb  5 01:35:04 2022:  (2) /usr/local/lib/libVC4CC.so.1.2 : vc4c::qpu_asm::GraphColoring::toRegisterMap() const+0x148 [0x7640cb10]

Do you have any idea how to avoid such error?

Besides, for the CL compilation time, I was wondering do you have any suggestion to walk around it? Compile the CL code is pretty time-consuming and I have to wait for minutes even if I made a small modification to the code...

localItems is 32, 1

There is your problem. VC4CL only supports a local size of up to 12, since the VC4 GPU has only 12 cores. Although OpenCL clients should check what the implementation supports, a lot of applications don't do that and assume some minimum value and thus don't work on VC4CL. Depending on how the kernel is written, it might work without specifying a fixed local size, so that VC4CL can decide on one...

Do you have any idea how to avoid such error?

Sadly no, not at the moment at least. The register allocation of VC4C is non-deterministic, so if a kernel has a certain complexity, then register allocation will succeed sometimes and fail other times...

Besides, for the CL compilation time, I was wondering do you have any suggestion to walk around it? Compile the CL code is pretty time-consuming and I have to wait for minutes even if I made a small modification to the code...

If darknet compiles all kernels from the same place, then maybe adding caching there would be the easiest solution. E.g. create a hash of kernel source code + options, check in some folder if a file with that hash exists. If not, compile and on success store with that hash in the folder.

Thanks! I reduced the local size and it worked. But I have another problem that the program gets stuck at some CL interface functions.

The first place is pull_network_output, which pulls array from opencl. And the program stuck at this clEnqueueReadBuffer.

After I commented out the pull_network_output function, there is another place to stuck. When the program is about to finish, the function clFinish stuck.

Do you have any idea how to solve these problems?

Looks like some event gets stuck/takes long and the successive events have to wait. Depending on how the kernel is executed, this could mean that the kernel execution hangs.

If you run the program with the VC4CL_DEBUG=events environment variable set and post the standard output, maybe I could spot the issue. Also, how long did you leave it running?

Thanks to your suggestion, I changed the code so that the kernel code is cached. Loading and building the cached program is much faster than compiling the source. Now I don't need to wait for a long time to debug the program.

I ran the program with VC4CL_DEBUG=events. When the program is stuck, the console output is

[VC4CL](VC4CL Queue Han): Executing event action: run kernel 0x0x920778 (im2col_gpu_kernel) with 2621 instructions, 12 parameters and 12, 1, 1 (3072, 1, 1) work-items

Then I wait for several minutes and it outputs another line:

[VC4CL](VC4CL Queue Han): Executing event action: run kernel 0x0x900008 (gemm_kernel) with 1314 instructions, 18 parameters and 8, 1, 1 (8, 131072, 1) work-items

Then I wait for 20+ minutes, it does not output any other things.

I also tried other codes. It seems that whenever I call clEnqueueReadBuffer, the program gets stuck. The console outputs with VC4CL_DEBUG=events are the same: first Executing event action: im2col_gpu_kernel, after waiting for several minutes I got Executing event action: gemm_kernel, then no output anymore.

im2col_gpu_kernel and gemm_kernel are compiled and built during program initialization and seem not relevant to clEnqueueReadBuffer. Do you have any idea?

So the first kernel seems to take very long for its 3072/12 = 256 work-groups. Unless a single work-item does a huge amount of calculations, they should probably finish in less than a minute.

The second kernel runs 131072 work-groups, which definitively takes some time. I don't know whether 20+ minutes is okay or not...

im2col_gpu_kernel and gemm_kernel are compiled and built during program initialization and seem not relevant to clEnqueueReadBuffer. Do you have any idea?

Without having looked at the darknet code in detail, I would assume it schedules the kernels for execution and then sets their events as dependencies for the clEnqueueReadBuffer, which then blocks until the kernels finish. This is presumably why this function seems to take so long (while in fact it waits for the kernels to finish).

To check whether the kernel executions finish successfully or time out (in which case the GPU gets into some weird state), you could also enable the execution debug flag, i.e. with VC4CL_DEBUG=events,execution. This might dump a lot of info to the standard output, but the important line would be after the kernel executes/hangs for a few minutes, there should be a "Execution: successful" or "Execution: failed" line which tells whether the kernel execution succeeded (just on a firmware level, not necessarily with correct data) or not (i.e. the execution took too long and was aborted). Also a few lines above you will see the calculated timeout before the kernel execution is considered hung and will be aborted.

I suspect that the code generated for these kernels is somehow either wrong and hangs the QPUs or is far too slow and thus the kernel executions time out. If the QPUs are hung, then everything after that will also just hang until they are reset...

I run the program with VC4CL_DEBUG=events,execution flag, but still did get no idea. As far as I observed. There is only Execution: successful, there is no Execution: failed. The program has several layers and the code for each layer is basically the same. I observed that the program successfully process the first two layers, and got stuck at later layers. I attached the two log files here log.log and log0.log.

However, when I run the code without VC4CL_DEBUG=events,execution flag, the program successfully proceeds to the last layer (which is the first time to call clEnqueueReadBuffer). I checked the return value of the CL invocation functions and they are all CL_SUCCESS. I was wondering is it because the computation on the VPU is asynchronous with the CPU program?

I ran the code again with a simple network with only one layer and record the log. As I expected the code also got stuck. Here is the log.log. Still I didn't see any Execution: failed but there are two Execution: successful.

So I had another quick look at the last log you provided (similar values can also be seen in the first log). Given that the duration in line 694 looks suspiciously close (~99.99%) to the timeout in line 692, it might be that the execution times out internally, but it is somehow still reported as success...

[VC4CL](VC4CL Queue Han): Running work-group 0, 0, 0 with a timeout of 256000000 us
[VC4CL](        darknet): Setting kernel-argument 17 to scalar 1024
[VC4CL](VC4CL Queue Han): Execution: successful after 255991535 us

If that is the case that most likely means that the VC4C compiler generates some wrong code which causes the QPUs to run inifinte/hang.

@doe300 hi, I also always encounter gnutls_handshake() failed: The TLS connection was non-properly terminated. error after typing cmake .. in VC4C.

pi@raspberrypi:~/opencl/VC4C2/build $ cmake ..
CMake Deprecation Warning at CMakeLists.txt:4 (cmake_policy):
  The OLD behavior for policy CMP0026 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

-- VC4CL standard library headers found: /home/pi/opencl/VC4C2/../VC4CLStdLib/include/
-- Using CMake 3.14+ FetchContent to include dependencies...
[ 11%] Performing update step for 'spirv-headers-populate'
fatal: unable to access 'https://github.com/KhronosGroup/SPIRV-Headers.git/': gnutls_handshake() failed: The TLS connection was non-properly terminated.
CMake Error at /home/pi/opencl/VC4C2/build/_deps/spirv-headers-subbuild/spirv-headers-populate-prefix/tmp/spirv-headers-populate-gitupdate.cmake:55 (message):
  Failed to fetch repository
  'https://github.com/KhronosGroup/SPIRV-Headers.git'

make[2]: *** [CMakeFiles/spirv-headers-populate.dir/build.make:97: spirv-headers-populate-prefix/src/spirv-headers-populate-stamp/spirv-headers-populate-update] Error 1
make[1]: *** [CMakeFiles/Makefile2:76: CMakeFiles/spirv-headers-populate.dir/all] Error 2
make: *** [Makefile:84: all] Error 2

CMake Error at /usr/share/cmake-3.16/Modules/FetchContent.cmake:915 (message):
  Build step for spirv-headers failed: 2
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FetchContent.cmake:1006 (__FetchContent_directPopulate)
  /usr/share/cmake-3.16/Modules/FetchContent.cmake:1047 (FetchContent_Populate)
  cmake/spirv-headers.cmake:9 (FetchContent_MakeAvailable)
  CMakeLists.txt:110 (include)

-- Configuring incomplete, errors occurred!
See also "/home/pi/opencl/VC4C2/build/CMakeFiles/CMakeOutput.log".

It seems that sometimes I can not fetch the dependency libraries. However, I can get the libraries by git clone https://github.com/KhronosGroup/SPIRV-Headers.git/.

So I'd like to manually clone the libraries. What changes should I make so that I can disable the fetch function in cmake and manually clone the libraries? I guess I should put the gloned libraries under build/_dep?

The only thing that I can think of right now is to download the git project as an archive (e.g. via GitHub website) and then replace the line https://github.com/doe300/VC4C/blob/master/cmake/spirv-headers.cmake#L8 with something like

    FetchContent_Declare(SPIRV-Headers URL <path/to/project/archive.zip>)

hi @doe300 , I have solved the cache problem. As for the hanging problem, how can I find the wrong code? Can I upload generated code here and can you help me to look at it?

As for the hanging problem, how can I find the wrong code? Can I upload generated code here and can you help me to look at it?

That is very tricky, especially for rather complex code as I would assume darknet produces. If you give me the generated binary code and the VC4CL debug logs, I can try to run it in the software emulator with some default parameter values and hope that the reason for the hang/long execution time becomes obvious. If the behaviour depends on the actual input buffers etc., then it becomes a lot more complicated...

doe300 / VC4CL

Encountering problems when running darknet on RPI3B+ #106