Closed csuji closed 11 months ago
For Leela Zero: With a minor change in src/kernels/clblast/xgemm_part2.opencl
-INLINE_FUNC void StoreResults(__global memM* cgm, realM cpm[NWI*MWI/VWM], const int kSizeM) {
+INLINE_FUNC void StoreResults(__global memM* cgm, realM *cpm, const int kSizeM) {
at least "tuning" is finishing. But when running the tests some other OpenCL error is shown
/tmp/comgr-f07b6e/input/CompileSource:471:27: error: passing 'real (*)[6]' to parameter of type 'real (*)[6]' changes address space of pointer
__in_transform_eq(x, V, offset, CPpad);
^
/tmp/comgr-f07b6e/input/CompileSource:366:29: note: passing argument to parameter 'x' here
void __in_transform_eq(real x[WINOGRAD_ALPHA][WINOGRAD_ALPHA], __global net_t * restrict V, int offset, int CPpad) {
^
Ok, thank you @seesturm ! I saw that error too when trying to compile the dumped gemm kernel via /opt/rocm/bin/x64/clang. The author of leela zero used the code from https://github.com/CNugteren/CLBlast a few years ago when he started the project. They have already changed this function call because another OpenCL implemenation had problem with this passing of private pointer to a function with no qualifier. I am not an OpenCL expert but standard states
OpenCL implements the following disjoint address spaces: global, local, constant and private.
The address space qualifier may be used in variable declarations to specify the region of memory that is used to allocate the object. The C syntax for type qualifiers is extended in OpenCL to include an address space name as a valid type qualifier. If the type of an object is qualified by an address space name, the object is allocated in the specified address name; otherwise, the object is allocated in the generic address space.
The address space names without the prefix i.e. global, local, constant and private may be substituted for the corresponding address space names with the prefix.
The generic address space name for arguments to a function in a program, or local variables of a function is private. All function arguments shall be in the private address space.
kernel function arguments declared to be a pointer of a type can point to one of the following address spaces only: global, local or constant. A pointer to address space A can only be assigned to a pointer to the same address space A. Casting a pointer to address space A to a pointer to address space B is illegal.
So default is private right? (question to the OpenCL experts) Either nearly all OpenCL compilers (including NVIDIA) are ok with this or AMDs is now too strict?! Anyway I added a private qualifier to the function arguments and now leela zero compiles and runs flawlessly:
diff --git a/src/kernels/clblast/xgemm_part2.opencl b/src/kernels/clblast/xgemm_part2.opencl
index b9ff537..0309931 100644
--- a/src/kernels/clblast/xgemm_part2.opencl
+++ b/src/kernels/clblast/xgemm_part2.opencl
@@ -66,7 +66,7 @@ INLINE_FUNC realM MultiplyAddVector(realM cvec, const realM avec, const real bva
// =================================================================================================
// Merges the results in Cpm with the global array in Cgm.
-INLINE_FUNC void StoreResults(__global memM* cgm, realM cpm[NWI*MWI/VWM], const int kSizeM) {
+INLINE_FUNC void StoreResults(__global memM* cgm, __private realM cpm[NWI*MWI/VWM], const int kSizeM) {
#pragma unroll
for (int _ni = 0; _ni < NWI; _ni += 1) {
#pragma unroll
diff --git a/src/kernels/convolve3.opencl b/src/kernels/convolve3.opencl
index c422f55..459157e 100644
--- a/src/kernels/convolve3.opencl
+++ b/src/kernels/convolve3.opencl
@@ -106,7 +106,7 @@ void multiply_at(
*o3 = o.w;
}
-void __in_transform_eq(real x[WINOGRAD_ALPHA][WINOGRAD_ALPHA], __global net_t * restrict V, int offset, int CPpad) {
+void __in_transform_eq(__private real x[WINOGRAD_ALPHA][WINOGRAD_ALPHA], __global net_t * restrict V, int offset, int CPpad) {
const int W = BOARD_SIZE;
const int H = BOARD_SIZE;
CLBlast has fixed this with a for loop over all elements of the matrix but I am too lazy to backport this new CLBlast version since a lot more changed too and nobody knows what else would break.
Regarding KataGo seg fault I did not do any debugging but seams another problem.
One of the problematic kernels during tuning lz_1_gfx900.cl.gz
KataGo uses newer version of CLBlast as well. Core dump is reproducible when using CLBlast:
git clone https://github.com/CNugteren/CLBlast
cd CLBlast
mkdir build
cd build
cmake ..
make -j6
./clblast_tuner_xgemm_direct
(output deleted)
| 20 | 41 | 32 8 16 8 16 2 1 1 1 1 | OK 587 ms | 0.15 ms | 228.5 | results match |
| 21 | 41 | 32 8 16 8 16 2 1 2 1 1 | OK 580 ms | 0.14 ms | 231.9 | results match |
| 22 | 41 | 32 8 16 8 16 2 2 1 1 1 | OK 582 ms | 0.16 ms | 214.8 | results match |
| 23 | 41 | 32 8 16 8 16 2 2 2 1 1 | OK 615 ms | 0.14 ms | 240.7 | results match |
| 24 | 41 | 32 8 16 8 16 2 4 1 1 1 | OK 575 ms | 0.14 ms | 244.0 | results match |
| 25 | 41 | 32 8 16 8 16 2 4 2 1 1 | OK 601 ms | 0.15 ms | 224.7 | results match |
allocate SGPR spill should have worked
UNREACHABLE executed at /data/jenkins-workspace/compute-rocm-rel-3.0/external/llvm-project/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp:1047!
Aborted (core dumped)
Update with 3.3: still fails BUT in a later tuning step with just a segmentation fault:
| ID | total | param | compiles | time | GFLOPS | status | x------x-------x---------------------------------------------------x----------------x--------------x--------x-------------------x | ref | - | - | OK | 0.10 ms | - | reference OK | x------x-------x---------------------------------------------------x----------------x--------------x--------x-------------------x | 1 | 108 | 64 16 8 16 32 2 2 2 0 0 | OK 1443 ms | 0.37 ms | 89.5 | results match | | 2 | 108 | 64 8 16 8 8 2 2 2 1 1 | OK 1404 ms | 0.30 ms | 112.7 | results match | | 3 | 108 | 16 8 16 16 8 2 1 1 0 0 | OK 305 ms | 0.18 ms | 190.3 | results match | Segmentation fault (core dumped)
Is someone working on this?
Hi @csuji, please check latest ROCm Documentation and ROCm 5.7.1 to see if your issue has been resolved. If resolved, please close the ticket. Thanks.
Original ticket is more than a year old and the person that opened the ticket has not responded to the latest request. If this is still an issue, please file a new ticket and we will investigate. Thanks!
Configuration: Vega Frontier card and
When trying to tune Leela Zero (this worked with 2.10):
When trying to tune with KataGo (this did not work with 2.10 or 2.x version I tried):
To reproduce just check out repo https://github.com/leela-zero/leela-zero or https://github.com/lightvector/KataGo. Follow build instructions. Download trained neural network. Run.