ROCm 3.0 fails to tune OpenCL kernels with Leela Zero or KataGo

csuji commented 4 years ago

Configuration: Vega Frontier card and

dkms status
amdgpu, 3.0-6, linux kernel 5.0.0-37-generic, x86_64

When trying to tune Leela Zero (this worked with 2.10):

./leelaz --tune-only some_downloaded_net.gz
...... (output deleted)
Detected 1 OpenCL platforms.
Platform version: OpenCL 2.1 AMD-APP (3052.0)
Platform profile: FULL_PROFILE
Platform name:    AMD Accelerated Parallel Processing
Platform vendor:  Advanced Micro Devices, Inc.
Device ID:     0
Device name:   gfx900
Device type:   GPU
Device vendor: Advanced Micro Devices, Inc.
Device driver: 3052.0 (HSA1.1,LC)
Device speed:  1600 MHz
Device cores:  64 CU
Device score:  1121
Selected platform: AMD Accelerated Parallel Processing
Selected device: gfx900
with OpenCL 2.1 capability.
Half precision compute support: Yes.
Tensor Core support: No.

Started OpenCL SGEMM tuner.
Will try 290 valid configurations.
Failed to compile: 290 kernels.
Failed to find a working configuration.
Check your OpenCL drivers.
Minimum error: 100.000000. Error bound: 0.000100
terminate called after throwing an instance of 'std::runtime_error'
  what():  Tuner failed to find working configuration.
Aborted (core dumped)

When trying to tune with KataGo (this did not work with 2.10 or 2.x version I tried):

./katago tuner -model g104-b20c256-s447913472-d241840887/model.txt.gz 
2020-01-07 18:55:01+0100: Loading model...
2020-01-07 18:55:06+0100: Querying system devices...
2020-01-07 18:55:06+0100: Found OpenCL Platform 0: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3052.0))
2020-01-07 18:55:06+0100: Found 1 device(s) on platform 0 with type GPU or Accelerator
2020-01-07 18:55:06+0100: Found OpenCL Device 0: gfx900 (Advanced Micro Devices, Inc.)
2020-01-07 18:55:06+0100: Tuner starting...
2020-01-07 18:55:06+0100: Using OpenCL Device 0: gfx900 (Advanced Micro Devices, Inc.) OpenCL 2.0 
==============================================================================
Tuning device 0: gfx900
File does not alrady exist or unable to parse parameters in: /home/xxxxx/.katago/opencltuning/tune_gpugfx900_x19_y19_c256_mv5.txt
Starting fresh tuning, saving results to /home/xxxxx/.katago/opencltuning/tune_gpugfx900_x19_y19_c256_mv5.txt
Setting winograd3x3TileSize = 4
------------------------------------------------------
Tuning winograd transform for convolutions
Testing 183 different configs
Tuning 0/183 (reference) Calls/sec 1739.63 L2Error 0  transLocalSize0=1 transLocalSize1=1 transLocalSize2=1
...........(output deleted)
Tuning xGemmDirect for convolutions
Testing 56 different configs
allocate SGPR spill should have worked
UNREACHABLE executed at /data/jenkins-workspace/compute-rocm-rel-3.0/external/llvm-project/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp:1047!
Aborted (core dumped)

To reproduce just check out repo https://github.com/leela-zero/leela-zero or https://github.com/lightvector/KataGo. Follow build instructions. Download trained neural network. Run.

seesturm commented 4 years ago

For Leela Zero: With a minor change in src/kernels/clblast/xgemm_part2.opencl

-INLINE_FUNC void StoreResults(__global memM* cgm, realM cpm[NWI*MWI/VWM], const int kSizeM) {
+INLINE_FUNC void StoreResults(__global memM* cgm, realM *cpm, const int kSizeM) {

at least "tuning" is finishing. But when running the tests some other OpenCL error is shown

/tmp/comgr-f07b6e/input/CompileSource:471:27: error: passing 'real (*)[6]' to parameter of type 'real (*)[6]' changes address space of pointer
        __in_transform_eq(x, V, offset, CPpad);
                          ^
/tmp/comgr-f07b6e/input/CompileSource:366:29: note: passing argument to parameter 'x' here
void __in_transform_eq(real x[WINOGRAD_ALPHA][WINOGRAD_ALPHA], __global net_t * restrict V, int offset, int CPpad) {
                            ^

csuji commented 4 years ago

Ok, thank you @seesturm ! I saw that error too when trying to compile the dumped gemm kernel via /opt/rocm/bin/x64/clang. The author of leela zero used the code from https://github.com/CNugteren/CLBlast a few years ago when he started the project. They have already changed this function call because another OpenCL implemenation had problem with this passing of private pointer to a function with no qualifier. I am not an OpenCL expert but standard states

OpenCL implements the following disjoint address spaces: global, local, constant and private.

The address space qualifier may be used in variable declarations to specify the region of memory that is used to allocate the object. The C syntax for type qualifiers is extended in OpenCL to include an address space name as a valid type qualifier. If the type of an object is qualified by an address space name, the object is allocated in the specified address name; otherwise, the object is allocated in the generic address space.

The address space names without the prefix i.e. global, local, constant and private may be substituted for the corresponding address space names with the prefix.

The generic address space name for arguments to a function in a program, or local variables of a function is private. All function arguments shall be in the private address space.

kernel function arguments declared to be a pointer of a type can point to one of the following address spaces only: global, local or constant. A pointer to address space A can only be assigned to a pointer to the same address space A. Casting a pointer to address space A to a pointer to address space B is illegal.

So default is private right? (question to the OpenCL experts) Either nearly all OpenCL compilers (including NVIDIA) are ok with this or AMDs is now too strict?! Anyway I added a private qualifier to the function arguments and now leela zero compiles and runs flawlessly:

diff --git a/src/kernels/clblast/xgemm_part2.opencl b/src/kernels/clblast/xgemm_part2.opencl
index b9ff537..0309931 100644
--- a/src/kernels/clblast/xgemm_part2.opencl
+++ b/src/kernels/clblast/xgemm_part2.opencl
@@ -66,7 +66,7 @@ INLINE_FUNC realM MultiplyAddVector(realM cvec, const realM avec, const real bva
 // =================================================================================================

 // Merges the results in Cpm with the global array in Cgm.
-INLINE_FUNC void StoreResults(__global memM* cgm, realM cpm[NWI*MWI/VWM], const int kSizeM) {
+INLINE_FUNC void StoreResults(__global memM* cgm, __private realM cpm[NWI*MWI/VWM], const int kSizeM) {
   #pragma unroll
   for (int _ni = 0; _ni < NWI; _ni += 1) {
     #pragma unroll
diff --git a/src/kernels/convolve3.opencl b/src/kernels/convolve3.opencl
index c422f55..459157e 100644
--- a/src/kernels/convolve3.opencl
+++ b/src/kernels/convolve3.opencl
@@ -106,7 +106,7 @@ void multiply_at(
     *o3 = o.w;
 }

-void __in_transform_eq(real x[WINOGRAD_ALPHA][WINOGRAD_ALPHA], __global net_t * restrict V, int offset, int CPpad) {
+void __in_transform_eq(__private real x[WINOGRAD_ALPHA][WINOGRAD_ALPHA], __global net_t * restrict V, int offset, int CPpad) {

     const int W = BOARD_SIZE;
     const int H = BOARD_SIZE;

CLBlast has fixed this with a for loop over all elements of the matrix but I am too lazy to backport this new CLBlast version since a lot more changed too and nobody knows what else would break.

Regarding KataGo seg fault I did not do any debugging but seams another problem.

csuji commented 4 years ago

One of the problematic kernels during tuning lz_1_gfx900.cl.gz

csuji commented 4 years ago

KataGo uses newer version of CLBlast as well. Core dump is reproducible when using CLBlast:

git clone https://github.com/CNugteren/CLBlast
cd CLBlast
mkdir build
cd build
cmake ..
make -j6
 ./clblast_tuner_xgemm_direct
(output deleted)
|   20 |    41 |   32    8   16    8   16    2    1    1    1    1 |   OK    587 ms |      0.15 ms |  228.5 |     results match |
|   21 |    41 |   32    8   16    8   16    2    1    2    1    1 |   OK    580 ms |      0.14 ms |  231.9 |     results match |
|   22 |    41 |   32    8   16    8   16    2    2    1    1    1 |   OK    582 ms |      0.16 ms |  214.8 |     results match |
|   23 |    41 |   32    8   16    8   16    2    2    2    1    1 |   OK    615 ms |      0.14 ms |  240.7 |     results match |
|   24 |    41 |   32    8   16    8   16    2    4    1    1    1 |   OK    575 ms |      0.14 ms |  244.0 |     results match |
|   25 |    41 |   32    8   16    8   16    2    4    2    1    1 |   OK    601 ms |      0.15 ms |  224.7 |     results match |
allocate SGPR spill should have worked
UNREACHABLE executed at /data/jenkins-workspace/compute-rocm-rel-3.0/external/llvm-project/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp:1047!
Aborted (core dumped)

csuji commented 4 years ago

Update with 3.3: still fails BUT in a later tuning step with just a segmentation fault:

| ID | total | param | compiles | time | GFLOPS | status | x------x-------x---------------------------------------------------x----------------x--------------x--------x-------------------x | ref | - | - | OK | 0.10 ms | - | reference OK | x------x-------x---------------------------------------------------x----------------x--------------x--------x-------------------x | 1 | 108 | 64 16 8 16 32 2 2 2 0 0 | OK 1443 ms | 0.37 ms | 89.5 | results match | | 2 | 108 | 64 8 16 8 8 2 2 2 1 1 | OK 1404 ms | 0.30 ms | 112.7 | results match | | 3 | 108 | 16 8 16 16 8 2 1 1 0 0 | OK 305 ms | 0.18 ms | 190.3 | results match | Segmentation fault (core dumped) Is someone working on this?

nartmada commented 11 months ago

Hi @csuji, please check latest ROCm Documentation and ROCm 5.7.1 to see if your issue has been resolved. If resolved, please close the ticket. Thanks.

nartmada commented 11 months ago

Original ticket is more than a year old and the person that opened the ticket has not responded to the latest request. If this is still an issue, please file a new ticket and we will investigate. Thanks!

ROCm / ROCm

ROCm 3.0 fails to tune OpenCL kernels with Leela Zero or KataGo #995