ROCm / HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
https://rocmdocs.amd.com/projects/HIP/
MIT License
3.71k stars 528 forks source link

Bug with hipcc -O0 on gfx1035 and gfx906 #3183

Open drtpotter opened 1 year ago

drtpotter commented 1 year ago

Hi there,

I understand that the gfx1035 architecture on an integrated Radeon 680M is not yet officially supported by AMD. However if it is of interest to future support then there is a runtime bug with HIP when compiling with hipcc and optimisation level -O0.

This matrix multiplication code works as expected with normal compilation.

hipcc mat_mult_bugreport.cpp -o a.out ./a.out Maximum error (infinity norm) is: 5.72205e-06

When compiling with hipcc and optimisation level -O0 it does not produce the expected result.

hipcc -O0 mat_mult_bugreport.cpp -o a.out ./a.out Maximum error (infinity norm) is: 3.39192e+38

Please find attached the code for this below.

mat_mult_bugreport.cpp.txt

jatinx commented 1 year ago

HIP does not support gfx1035 officially. That being said, the ISA should be similar to gfx1031 on which this runs fine.

Can you share the ROCm version and HIP version (hipcc --version), I can try to look at it.

drtpotter commented 1 year ago

Hi Jatin,

Sure, here is the requested info for the rocm version I am using.

HIP version: 5.4.22802-aaa1e3d8 AMD clang version 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.4.1 22465 d6f0fe8b22e3d8ce0f2cbd657ea14b16043018a5) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/rocm-5.4.1/llvm/bin

I did not see the error on gfx90a with rocm 5.0.2, just on my laptop with gfx1035 and rocm 5.4.1 (the version above). The infinity norm should be in the range 10^(-6) - 10^(-5).

yxsamliu commented 1 year ago

can you compile the code with -save-temps and attach the dumped -gfx1035.bc and -gfx1035.s files? thanks.

drtpotter commented 1 year ago

Sure here are the temp files for when compilation with -O0 produces an error.

hipcc -O0 -save-temps mat_mult_bugreport.cpp -o a.out

The files for buggy compilation are here

mat_mult_bugreport-hip-amdgcn-amd-amdhsa-gfx1035.bc.txt

mat_mult_bugreport-hip-amdgcn-amd-amdhsa-gfx1035.s.txt

drtpotter commented 1 year ago

Here are the temp files for normal compilation where the code works fine.

hipcc -save-temps mat_mult_bugreport.cpp -o a.out

mat_mult_bugreport-hip-amdgcn-amd-amdhsa-gfx1035.bc.txt

mat_mult_bugreport-hip-amdgcn-amd-amdhsa-gfx1035.s.txt

yxsamliu commented 1 year ago

I saw the test passing on gfx1031 with llvm commit hash 51371ceeab92. The ISA for gfx1031 is almost the same as gfx1035, therefore it should pass on gfx1035 too. The most significant difference between the ISA that is passing and failing is saving and restore exec when restoring spilled sreg from vreg, which is likely due to https://reviews.llvm.org/D124196. This makes sense since a bug about sreg spilling could cause incorrect results.

Hopefully, the fix will be available in the next ROCm release.

drtpotter commented 1 year ago

Hi folks,

Just an update to this. I am still seeing the problem with ROCM 5.4.3. I now have access to an officially supported GFX906 architecture and I am seeing the same issue on GFX906 with ROCM 5.4.3. I also continue to see the problem with ROCM 5.4.3 on GFX1035.

hipcc mat_mult_bugreport.cpp -o a.out ./a.out

I get the expected result and the error is within machine precision.

Maximum error (infinity norm) is: 5.72205e-06

Now try with option -O0 and I get an erroneous result on gfx906.

hipcc -O0 mat_mult_bugreport.cpp -o a.out

./a.out Maximum error (infinity norm) is: 24.3876

On gfx1035 with ROCM 5.4.3 I still get the erroneus result ./a.out Maximum error (infinity norm) is: 3.25297e+38

goetzgaycken commented 1 year ago

I am experiencing a similar problem.

I am currently experimenting with rocm 5.4.2.50402-104~22.04 on Debian with a custom kernel 6.2.10 on a laptop with an AMD Rembrandt 6800u. (Initially, I was compiling for the wrong offload architecture, gfx1030 instead of gfx1035, unknowingly. Since then I was always compiling with -O0 and -g to understand the problem. But, then I ran into the following problem:

:3:hip_module.cpp           :469 : 15259417612 us: 25435: [tid:0x7fb6a9a2c880]  hipLaunchKernel ( 0x201050, {1,1,1}, {4,4,1}, 0x7ffdf8c44b40, 0, stream:<null> ) 
:3:devprogram.cpp           :2676: 15259417754 us: 25435: [tid:0x7fb6a9a2c880] Using Code Object V4.
:3:devprogram.cpp           :2979: 15259417988 us: 25435: [tid:0x7fb6a9a2c880] For Init/Fini: Kernel Name: _Z23matrix_transpose_kernelPfPKfj
:4:command.cpp              :349 : 15259417999 us: 25435: [tid:0x7fb6a9a2c880] Command (KernelExecution) enqueued: 0x25c0100
:3:rocvirtual.cpp           :703 : 15259418003 us: 25435: [tid:0x7fb6a9a2c880] Arg0:   = ptr:0x7fb5a5e01000 obj:[0x7fb5a5e01000-0x7fb5a5e01040]
:3:rocvirtual.cpp           :703 : 15259418004 us: 25435: [tid:0x7fb6a9a2c880] Arg1:   = ptr:0x7fb5a5e00000 obj:[0x7fb5a5e00000-0x7fb5a5e00040]
:3:rocvirtual.cpp           :778 : 15259418006 us: 25435: [tid:0x7fb6a9a2c880] Arg2:   = val:4
:3:rocvirtual.cpp           :2774: 15259418007 us: 25435: [tid:0x7fb6a9a2c880] ShaderName : _Z23matrix_transpose_kernelPfPKfj
:1:rocvirtual.cpp           :2822: 15259418009 us: 25435: [tid:0x7fb6a9a2c880] Pcie atomics not enabled, hostcall not supported
:1:rocvirtual.cpp           :3137: 15259418010 us: 25435: [tid:0x7fb6a9a2c880] AQL dispatch failed!
:4:command.cpp              :179 : 15259418011 us: 25435: [tid:0x7fb6a9a2c880] Command 0x25c0100 complete
:3:hip_module.cpp           :470 : 15259418013 us: 25435: [tid:0x7fb6a9a2c880] hipLaunchKernel: Returned hipSuccess : 
:3:hip_error.cpp            :27  : 15259418016 us: 25435: [tid:0x7fb6a9a2c880]  hipGetLastError (  ) 
...
Validating transposed matrix.
Validation failed with 16 errors.

Finally I found this discussion, Otherwise I presumably would not have tried to switch to O1. When compiling warp_shuffle/main.hip with O1, the problem disappears:

:3:hip_module.cpp           :469 : 15220897062 us: 25374: [tid:0x7fbdaf05d880]  hipLaunchKernel ( 0x2010a0, {1,1,1}, {4,4,1}, 0x7ffe6421a5d0, 0, stream:<null> ) 
:3:devprogram.cpp           :2676: 15220897181 us: 25374: [tid:0x7fbdaf05d880] Using Code Object V4.
:3:devprogram.cpp           :2979: 15220897374 us: 25374: [tid:0x7fbdaf05d880] For Init/Fini: Kernel Name: _Z23matrix_transpose_kernelPfPKfj
:4:command.cpp              :349 : 15220897384 us: 25374: [tid:0x7fbdaf05d880] Command (KernelExecution) enqueued: 0xef9460
:3:rocvirtual.cpp           :703 : 15220897389 us: 25374: [tid:0x7fbdaf05d880] Arg0:   = ptr:0x7fbca7201000 obj:[0x7fbca7201000-0x7fbca7201040]
:3:rocvirtual.cpp           :703 : 15220897390 us: 25374: [tid:0x7fbdaf05d880] Arg1:   = ptr:0x7fbca7200000 obj:[0x7fbca7200000-0x7fbca7200040]
:3:rocvirtual.cpp           :778 : 15220897391 us: 25374: [tid:0x7fbdaf05d880] Arg2:   = val:4
:3:rocvirtual.cpp           :2774: 15220897393 us: 25374: [tid:0x7fbdaf05d880] ShaderName : _Z23matrix_transpose_kernelPfPKfj
:4:rocvirtual.cpp           :862 : 15220897397 us: 25374: [tid:0x7fbdaf05d880] HWq=0x7fbdaf1d0000, Dispatch Header = 0xb02 (type=2, barrier=1, acquire=1, release=1), setup=3, grid=[4, 4, 1], workgroup=[4, 4, 1], private_seg_size=0, group_seg_size=0, kernel_obj=0x7fbdaefca540, kernarg_address=0x7fbca6200000, completion_signal=0x0
:3:hip_error.cpp            :27  : 15220897403 us: 25374: [tid:0x7fbdaf05d880]  hipGetLastError (  ) 
....
Validating transposed matrix.
Validation passed.
ppanchad-amd commented 5 months ago

@drtpotter Can you please test with latest ROCm 6.0.2 (HIP 6.0.32831) to see if issue still occurs? Thanks!

drtpotter commented 5 months ago

Hi! On GFX1035 I have given up trying to use the amdgpu-dkms driver because it messes around with my laptop display. This behaviour did not occur with ROCM 5.7. On GFX1035 I now just use HIP with the driver in the Linux kernel and that seems to work fine. With the open source kernel driver in kernel 6.5.0-25, GFX1035, and ROCM 6.0.2 I don't see the problem.

On GFX906 with ROCM 6.0.2 and the amdgpu-dkms driver the problem is still present.

hipcc -O0 mat_mult_bugreport.cpp -o a.out
./a.out
Maximum error (infinity norm) is: 24.3876
hipcc -O1 mat_mult_bugreport.cpp -o a.out
./a.out
Maximum error (infinity norm) is: 5.72205e-06

Feel free to incorporate this tool into any CI processes you have going!

d3v-null commented 3 months ago

Hi, I am also getting this error on my Radeon RX 7800 XT with rocm 6.1.2

proof of concept, opti.cu:

#include <hip/hip_runtime.h>

inline __device__ void do_nothing(double *arr) { return; }

__global__ void kernel()
{
    double arr[2045];
    do_nothing(arr);
}

int main()
{
    kernel<<<22, 33>>>();
    hipError_t _ = hipDeviceSynchronize();
    return 0;
}

when I compile with -OO I get a segfault, but not -O1

> hipcc -O0 --offload-arch=gfx1101 opti.cu -o opti && ./opti
Memory access fault by GPU node-2 (Agent handle: 0x1f48310) on address 0x7f9b9fe00000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)
> hipcc -O1 --offload-arch=gfx1101 opti.cu -o opti && ./opti
# no segfault
hipcc --version
HIP version: 6.1.40093-bd86f1708
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-6.1.2/llvm/bin
Configuration file: /opt/rocm-6.1.2/lib/llvm/bin/clang++.cfg
cjatin commented 3 months ago

That sounds like a compiler bug. Will check out the output from the clang version you provided. Will update this when I have something.

cjatin commented 3 months ago

The team can not seem to reproduce the issue internally.

pelahi commented 3 months ago

I can reproduce the optimisation dependent error with

HIP version: 5.7.31921-1949b1621
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.7.3 23382 f3e174a1d286158c06e4cc8276366b1d4bc0c914)
Target: x86_64-unknown-linux-gnu
Thread model: posix

and the compilation flags of

for ((i=0;i<4;i++))
do 
  hipcc -O${i} --offload-arch=gfx90a opti.cu -o opti-${i}
  ./opt-${i}  1> log.${i}.txt 2> errors.${i}.txt
done

But the oddness is that it is dependent on the array declaration in the code, specifically for smaller sizes of an array there is no issue.

__global__ void kernel()
{
    double arr[2045]; //sizes smaller that 2045 seem to work just fine. 
    do_nothing(arr);
}

Do you have ideas as to what the issue might be?

Can you perhaps provide the symbols from your builds (the output from nm)? It might be useful.

Cheers, Pascal