CHIP-SPV / chipStar

chipStar is a tool for compiling and running HIP/CUDA on SPIR-V via OpenCL or Level Zero APIs.
Other
184 stars 29 forks source link

JIT Failures after updating to latest IGC #725

Closed pvelesko closed 9 months ago

pvelesko commented 9 months ago

I use the following script for building and installing the latest intel-compute-runtime: https://github.com/pvelesko/intel-compute-runtime-build

After installing it:

╰─$ ./samples/0_MatrixMultiply/MatrixMultiply
Device name Intel(R) Arc(TM) A380 Graphics
MatrixMultiply: ./lib/SPIRV/SPIRVToLLVMDbgTran.cpp:1076: llvm::DIFile* SPIRV::SPIRVToLLVMDbgTran::getFile(SPIRV::SPIRVId): Assertion `SourceArgs.size() == OperandCount && "Invalid number of operands"' failed.
CHIP error [TID 7664] [1702111633.094515022] : Program BUILD LOG for device #0:Intel(R) Arc(TM) A380 Graphics:
IGC: Internal Compiler Error: Abnormal termination

CHIP error [TID 7664] [1702111633.094554951] : hipErrorNotInitialized (CL_BUILD_PROGRAM_FAILURE ) in /home/pvelesko/space/chipStar/main/src/backend/OpenCL/CHIPBackendOpenCL.cc:739:compile

CHIP error [TID 7664] [1702111633.094599766] : Caught Error: hipErrorNotInitialized
HIP API error

Going back to using old IGC resolves the issue

linehill commented 9 months ago

... Assertion SourceArgs.size() == OperandCount && "Invalid number of operands"' failed.

Looks like a llvm-spirv bug that manifests in some of its branches. The buggy assertion can be found in for example in the LLVM-16 branch which it expects all OpExtInst ... DebugSource ... instructions to have two operands (OperandCount). This does not seem right respect to the debug info spec which states the DebugSource instructions take one operand at minimum (counted after the OpExtInst’s instruction operand). The assertion and the operand count seems to be corrected in other branches like llvm_release_150 and llvm_release_170.

It could be that the assertion gets triggered because the SPIR-V is generated with a llvm-spirv version that produces OpExtInst … DebugSource instructions with single operand. At least the latest llvm-spirv from llvm_release_170 branch does this.

pvelesko commented 9 months ago

I built using LLVM-14 provided by apt. According to intel-compute-runtime build instructions, LLVM-14 is the supported version and LLVV-SPIRV-Translator should match versions. Perhaps the revision from apt is just a bit behind. So overall, not related to igc

linehill commented 9 months ago

Found out by chance that the -gdwarf-4 option has an effect on the debug info generated for device. This option is set on in debug builds of the chipStar and it appears in the compilation of HIP samples.

On LLVM-17, -gdwarf-4 generates OptExtInst … DebugSources instructions with a single operand which triggers the buggy assertion. On the other hand, -g generates OptExtInst … DebugSources instructions with two operands.

So we might dodge the assertion by avoiding using the -gdwarf-# option in the chipStar.

pvelesko commented 9 months ago

This option was originally introduced because without it there was some issue using gdb. Perhaps it's no longer necessary

pvelesko commented 9 months ago

Ran into another issue in IGC https://github.com/intel/intel-graphics-compiler/issues/310

but overall, this is resolved for us.