doe300 / VC4CL

OpenCL implementation running on the VideoCore IV GPU of the Raspberry Pi models
MIT License
728 stars 80 forks source link

Optimizer error: There should be no more function calls - convert_float_rtef () #24

Open fangq opened 6 years ago

fangq commented 6 years ago

The cl kernel I am trying to compile is mcxcl - the code download URL and test commands can be found in Issue #19.

There were two previous errors (precompilation error - due to the missing of CLANG_PATH, fixed in https://github.com/doe300/VC4C/issues/34, and an compilation error - due to the use of 64bit data structures, see https://github.com/doe300/VC4CL/issues/19#issuecomment-367266372).

To avoid using the 64bit data structures (ulong and double), one can append -J "-DUSE_LL5_RAND" flag in the command line. However, I am now encountering a new error, in the Optimization stage, error log is listed below:

root@raspberrypi:/home/pi/temp/mcxcl/example/benchmark# ./run_benchmark1.sh -n 1e5 -J "-DUSE_LL5_RAND"
...
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DMCX_SRC_PENCIL  -DUSE_LL5_RAND
Kernel build log:
[W] Fri Feb 23 16:27:23 2018: Warnings in precompilation:
[W] Fri Feb 23 16:27:23 2018: <stdin>:1039:1: warning: null character ignored
<U+0000>
^
1 warning generated.

[E] Fri Feb 23 16:27:41 2018:  (1) /usr/local/lib/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xf4 [0xb662977c]
[E] Fri Feb 23 16:27:41 2018:  (2) /usr/local/lib/libVC4CC.so.1.2 : +0x520e78 [0xb685de78]
[E] Fri Feb 23 16:27:41 2018:  (3) /usr/local/lib/libVC4CC.so.1.2 : std::_Function_handler<vc4c::InstructionWalker (vc4c::Module const&, vc4c::Method&, vc4c::InstructionWalker, vc4c::Configuration const&), vc4c::InstructionWalker (*)(vc4c::Module const&, vc4c::Method&, vc4c::InstructionWalker, vc4c::Configuration const&)>::_M_invoke(std::_Any_data const&, vc4c::Module const&, vc4c::Method&, vc4c::InstructionWalker&&, vc4c::Configuration const&)+0x7c [0xb6865048]
[E] Fri Feb 23 16:27:41 2018:  (4) /usr/local/lib/libVC4CC.so.1.2 : std::function<vc4c::InstructionWalker (vc4c::Module const&, vc4c::Method&, vc4c::InstructionWalker, vc4c::Configuration const&)>::operator()(vc4c::Module const&, vc4c::Method&, vc4c::InstructionWalker, vc4c::Configuration const&) const+0x94 [0xb6864624]
[E] Fri Feb 23 16:27:41 2018:  (5) /usr/local/lib/libVC4CC.so.1.2 : vc4c::optimizations::OptimizationStep::operator()(vc4c::Module const&, vc4c::Method&, vc4c::InstructionWalker, vc4c::Configuration const&) const+0x54 [0xb685dd58]
[E] Fri Feb 23 16:27:41 2018:  (6) /usr/local/lib/libVC4CC.so.1.2 : +0x521158 [0xb685e158]
[E] Fri Feb 23 16:27:41 2018:  (7) /usr/local/lib/libVC4CC.so.1.2 : std::_Function_handler<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&), void (*)(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::_M_invoke(std::_Any_data const&, vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)+0x60 [0xb68654a8]
[E] Fri Feb 23 16:27:41 2018:  (8) /usr/local/lib/libVC4CC.so.1.2 : std::function<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const+0x78 [0xb68644cc]
[E] Fri Feb 23 16:27:41 2018:  (9) /usr/local/lib/libVC4CC.so.1.2 : vc4c::optimizations::OptimizationPass::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const+0x34 [0xb685dbe0]
[E] Fri Feb 23 16:27:41 2018:  (10) /usr/local/lib/libVC4CC.so.1.2 : +0x5215a8 [0xb685e5a8]
[E] Fri Feb 23 16:27:41 2018:  (11) /usr/local/lib/libVC4CC.so.1.2 : +0x5218c4 [0xb685e8c4]
[E] Fri Feb 23 16:27:41 2018:  (12) /usr/local/lib/libVC4CC.so.1.2 : +0x5228ec [0xb685f8ec]
[E] Fri Feb 23 16:27:41 2018:  (13) /usr/local/lib/libVC4CC.so.1.2 : std::function<void ()>::operator()() const+0x40 [0xb662f1b4]
[E] Fri Feb 23 16:27:41 2018:  (14) /usr/local/lib/libVC4CC.so.1.2 : threading::BackgroundWorker::operator()()::{lambda()#1}::operator()() const+0x58 [0xb662e3d8]
[E] Fri Feb 23 16:27:41 2018:  (15) /usr/local/lib/libVC4CC.so.1.2 : void std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>)+0x38 [0xb663617c]
[E] Fri Feb 23 16:27:41 2018:  (16) /usr/local/lib/libVC4CC.so.1.2 : std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::operator()()+0x1c [0xb66360e8]
[E] Fri Feb 23 16:27:41 2018:  (17) /usr/local/lib/libVC4CC.so.1.2 : std::thread::_State_impl<std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()> >::_M_run()+0x20 [0xb6636054]
[E] Fri Feb 23 16:27:41 2018:  (18) /usr/lib/arm-linux-gnueabihf/libstdc++.so.6 : +0x9c9dc [0xb6df99dc]
[E] Fri Feb 23 16:27:41 2018: Background worker threw error: Optimizer: There should be no more function-calls: float %call.i209.%call27.i = float convert_float_rtef(float %call.i209.%tmp.110)
[E] Fri Feb 23 16:27:43 2018: Compiler threw exception: Optimizer: There should be no more function-calls: float %call.i209.%call27.i = float convert_float_rtef(float %call.i209.%tmp.110)

MCX ERROR(2):Error: Failed to build program executable! in unit mcx_host.cpp:475
terminate called without an active exception
Aborted

it is not clear to me if this error was still caused by the accidental involvement of double precision data. the error was raised in the optimizer call to convert_float_rtef function.

@doe300, do you have any comment on what might be wrong? thanks

doe300 commented 6 years ago

Looks like I forgot to implement the OpenCL built-in finction convert_float_rte in the standard library.

Im off-line for the next couple of days, so the fix might take some time.

fangq commented 6 years ago

no problem. I look forward to testing this new function once it is added. thank you again for developing this wonderful library!

doe300 commented 6 years ago

https://github.com/doe300/VC4CLStdLib/commit/0cf109f6566b7b9bab915549a81a22b370011c34 adds these functions, can you re-check?

fangq commented 6 years ago

thanks @doe300. I recompiled vc4clstdlib and vc4c, reinstalled both packages, and reran my benchmark (./run_benchmark1.sh -n 1e5 -J "-DUSE_LL5_RAND"), the previously observed optimizer error no longer showed up.

Unfortunately, this is still not enough to get my kernel to run on the RPI. The program stalled after printing the line

Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DMCX_SRC_PENCIL -DUSE_LL5_RAND

here is the source code related to this message:

https://github.com/fangq/mcxcl/blob/master/src/mcx_host.cpp#L457-L458

it looks to me that the clBuildProgram() call kept running and not returned, otherwise, it either prints an error message, or print the message "build program complete" on line#477.

From the output of the top command, I don't see clang anymore, the mcxcl thread is taking 100% of the CPU. It has been running for over 50 minutes, but still no new message printing. I am going to kill it.

can you suggest how to debug this and find out what is stalling the clBuildProgram from returning?

thanks again.

fangq commented 6 years ago

I left the code running overnight, this morning, I saw the below error message on the screen, it must have been terminated many hours into the compilation:

root@raspberrypi:/home/pi/temp/mcxcl/example/benchmark# ./run_benchmark1.sh -n 1e5 -J "-DUSE_LL5_RAND"
==============================================================================
=                       Monte Carlo eXtreme (MCX) -- OpenCL                  =
...
- code name: [Vanilla MCXCL] compiled with OpenCL version [1]
- compiled with: [RNG] Logistic-Lattice [Seed Length] 5
initializing streams ...    init complete : 1 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DMCX_SRC_PENCIL  -DUSE_LL5_RAND
completed clBuildProgram
completed clGetProgramBuildInfo
Kernel build log:
[W] Fri Mar  2 00:30:27 2018: Warnings in precompilation:
[W] Fri Mar  2 00:30:27 2018: <stdin>:1039:1: warning: null character ignored
<U+0000>
^
1 warning generated.

[W] Fri Mar  2 01:22:50 2018: Register conflict resolver has exceeded its maximum rounds, there might still be errors!
[E] Fri Mar  2 01:22:50 2018: Error assigning local to register: %energylaunched.1606
[E] Fri Mar  2 01:22:50 2018:  (1) /usr/local/lib/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xac [0xb66ca484]
[E] Fri Mar  2 01:22:50 2018:  (2) /usr/local/lib/libVC4CC.so.1.2 : vc4c::qpu_asm::GraphColoring::toRegisterMap() const+0x150 [0xb6778e18]
[E] Fri Mar  2 01:22:50 2018:  (3) /usr/local/lib/libVC4CC.so.1.2 : vc4c::qpu_asm::CodeGenerator::generateInstructions(vc4c::Method&)+0x464 [0xb6765f4c]
[E] Fri Mar  2 01:22:50 2018:  (4) /usr/local/lib/libVC4CC.so.1.2 : +0x2edee8 [0xb66cbee8]
[E] Fri Mar  2 01:22:50 2018:  (5) /usr/local/lib/libVC4CC.so.1.2 : +0x2ee0c8 [0xb66cc0c8]
[E] Fri Mar  2 01:22:50 2018:  (6) /usr/local/lib/libVC4CC.so.1.2 : +0x2ef7f8 [0xb66cd7f8]
[E] Fri Mar  2 01:22:50 2018:  (7) /usr/local/lib/libVC4CC.so.1.2 : std::function<void ()>::operator()() const+0x40 [0xb66d01b4]
[E] Fri Mar  2 01:22:50 2018:  (8) /usr/local/lib/libVC4CC.so.1.2 : threading::BackgroundWorker::operator()()::{lambda()#1}::operator()() const+0x58 [0xb66cf3d8]
[E] Fri Mar  2 01:22:50 2018:  (9) /usr/local/lib/libVC4CC.so.1.2 : void std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>)+0x38 [0xb66d717c]
[E] Fri Mar  2 01:22:50 2018:  (10) /usr/local/lib/libVC4CC.so.1.2 : std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::operator()()+0x1c [0xb66d70e8]
[E] Fri Mar  2 01:22:50 2018:  (11) /usr/local/lib/libVC4CC.so.1.2 : std::thread::_State_impl<std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()> >::_M_run()+0x20 [0xb66d7054]
[E] Fri Mar  2 01:22:50 2018:  (12) /usr/lib/arm-linux-gnueabihf/libstdc++.so.6 : +0x9c9dc [0xb6e9a9dc]
[E] Fri Mar  2 01:22:51 2018: Background worker threw error: Label/Register Mapping: There are erroneous register-associations!
[E] Fri Mar  2 01:22:53 2018: Compiler threw exception: Label/Register Mapping: There are erroneous register-associations!

MCX ERROR(2):Error: Failed to build program executable! in unit mcx_host.cpp:476
terminate called without an active exception
Aborted
doe300 commented 6 years ago

My assumption: The input you are trying to compile is very complex (e.g. has long kernels). The register-allocation is the slowest part of VC4C, especially, if it cannot find a suitable register-mapping for all locals (as in your case). In this case, it will apply some fixes and try again a few times, which of course prolong the execution time even more. Depending on the structure of the failing kernel, VC4C may not be able to find a valid register-mapping at all, since it doesn't support register-spilling yet.

fangq commented 6 years ago

@doe300, sorry for the delay in getting back to you.

MCXCL needs at least 53 registers based on our results on nvidia/amd/intel cpus/gpus. If vc4c does not support spilling, this can be an issue.

what is the maximum number of registers per kernel supported by a VideoCore IV GPU? if there is no such restriction, but has a total size limit for a block, I can perhaps reduce the block size.

also, is there a "wavefront/warp" size? in VideoCore IV? in nvidia hardware, I typically set my block size to 32 or 64; in AMD GPUs, I typically use 64 for block size.

doe300 commented 6 years ago

A single QPU has 64 + 3/4 registers available. Since kernels are not threaded, every kernel has the same number of available registers. But not all registers can be used in instructions together, also additional registers may be reserved by variables not specified in the kernel source-code. The VideoCore IV has 12 processors (which can execute independently and limit the work-group size to 12 work-elements, one per QPU) with 16 SIMD-elements each (execute the same code, and can be leveraged by using vector-types, e.g. intto int16).

I think the best/only way to solve this is to implement register-spilling, since this problem will occur on any more complex kernel.