CHIP-SPV / chipStar

chipStar is a tool for compiling and running HIP/CUDA on SPIR-V via OpenCL or Level Zero APIs.
Other
157 stars 26 forks source link

Missing "__atomic_fetch_add_8" support #698

Open Sarbojit2019 opened 7 months ago

Sarbojit2019 commented 7 months ago

While running the QuickSilver app with Levelzero as backed I see a crash. Looks like "__atomic_fetch_add_8" support is missing.

QuickSilver : https://github.com/oneapi-src/Velocity-Bench/tree/main/QuickSilver

Error message: CHIP warning [TID 1675333] [1700634335.412708246] : Missing definition for '__atomic_fetch_add_8' CHIP warning [TID 1675333] [1700634335.412844660] : SPIR-V Parser: MemberId 82 not found in type map CHIP warning [TID 1675333] [1700634335.412847436] : SPIR-V Parser: MemberId 82 not found in type map CHIP warning [TID 1675333] [1700634335.412849475] : SPIR-V Parser: MemberId 82 not found in type map CHIP warning [TID 1675333] [1700634335.412873560] : SPIR-V Parser: MemberId 192 not found in type map CHIP warning [TID 1675333] [1700634335.412879170] : SPIR-V Parser: MemberId 192 not found in type map CHIP warning [TID 1675333] [1700634335.412882545] : SPIR-V Parser: MemberId 194 not found in type map CHIP error [TID 1675333] [1700634336.321535904] : hipErrorTbd (ZE_RESULT_ERROR_INVALID_MODULE_UNLINKED ) in /home/sarbojit/src/chipStar/src/backend/Level0/CHIPBackendLevel0.cc:2482:compile

CHIP error [TID 1675333] [1700634336.321705150] : Caught Error: hipErrorTbd error: #1054 (hipErrorTbd Aborted (core dumped)

linehill commented 7 months ago

Could you trace where the __atomic_fetch_add_8 is coming from? The function you see is a compiler built-in and not a HIP language feature. It is possible that there is an user error involved or an unintended device code injection via a constexpr function (constexpr functions are implicitly __host__ __device__ in HIP/CUDA mode).

linehill commented 7 months ago

Attempted to reproduce the issue with:

#include <hip/hip_runtime.h>
__device__ int x = 0;
__global__ void k() { __atomic_fetch_add(&x, 1, __ATOMIC_RELAXED); }

But this gets blocked by an assertion:

$ ../install/bin/hipcc atomic-builtins.hip -c
atomic-builtins.hip:3:23: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
__global__ void k() { __atomic_fetch_add(&x, 1, __ATOMIC_RELAXED); }
                      ^
clang-16: /mnt/md0/linehill/ws-chip-spv-3/llvm-project/llvm/lib/IR/Instructions.cpp:3356: static llvm::CastInst* llvm::CastInst::Create(llvm::Instruction::CastOps, llvm::Value*, llvm::Type*, const llvm::Twine&, llvm::Instruction*): Assertion `castIsValid(op, S, Ty) && "Invalid cast!"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
...

There is a possibility that the QuickSilver app is compiled with clang without assertions enabled and the compiler produces ill-formed code.

Sarbojit2019 commented 7 months ago

@linehill How do I enable/disable assertion in compiler? I have built clang locally in release mode and using the same for compiling QuickSilver. Are you suspecting it is bad kernel which is creating the issue? As per the repo looks like same code works for AMD.

linehill commented 7 months ago

How do I enable/disable assertion in compiler?

The assertions are enabled with -DCMAKE_BUILD_TYPE=Debug.

Are you suspecting it is bad kernel which is creating the issue?

Not sure yet. There might be very unexpected way the __atomic_fetch_add_8 gets introduced into the device code. Tracing the source of it could help here.

linehill commented 7 months ago

@Sarbojit2019, any success tracing the origin of the __atomic_fetch_add_8? I made a sanity check patch that may help on the tracing. Pull the patch and rebuild chipStar in Debug mode to enable the sanity checker.

pvelesko commented 1 month ago

@Sarbojit2019 status?