KhronosGroup / SPIRV-LLVM-Translator

A tool and a library for bi-directional translation between SPIR-V and LLVM IR
Other
464 stars 208 forks source link

Invalid OpenCL version and non-cpp source language lead to crash while generating device-dependent OpenCL binaries from a valid SPIR-V input #2515

Open VyacheslavLevytskyy opened 3 months ago

VyacheslavLevytskyy commented 3 months ago

At the moment, generation of device-dependent OpenCL program binary from a valid SPIR-V input crashes in two slightly different, but related cases:

  1. Let's consider a simple reproducer, referred further as ocl-ver-crash.ll:
    
    target datalayout = "e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-n8:16:32:64"
    target triple = "spir64-unknown-unknown"

define weak_odr dso_local spir_kernel void @foo() { entry: %myval.i = alloca float, align 4 %myval.ascast.i = addrspacecast ptr %myval.i to ptr addrspace(4) store float 0.000000e+00, ptr %myval.i, align 4 %call.i.i.i = call spir_func float @_Z21__spirv_AtomicFMaxEXT(ptr addrspace(4) %myval.ascast.i, i32 1, i32 896, float 1.230000e+02) ret void }

declare dso_local spir_func float @_Z21__spirv_AtomicFMaxEXT(ptr addrspace(4), i32, i32, float)


After creating SPIR-V by `llvm-as ocl-ver-crash.ll -o - | llvm-spirv --spirv-ext=+SPV_EXT_shader_atomic_float_min_max -o ocl-ver-crash.spv` and running AOT by `opencl-aot ocl-ver-crash.spv --device=cpu --cmd=build` we observe a crash of the build followed by this confusing diagnostics:

Failed to build device program CompilerException Failed to lookup symbol foo JIT session error: Symbols not found: [ _Z10atomic_maxPU3AS4Vff ] Failed to materialize symbols: { (main, { foo }) }


Let's look into LLVM IR that is produced from the problematic SPIRV (`llvm-spirv ocl-ver-crash.spv -r -o - | llvm-dis -o ocl-ver-crash-to-run.ll`):

; ModuleID = '' target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024" target triple = "spir64-unknown-unknown"

; Function Attrs: nounwind define spir_kernel void @foo() #0 !kernel_arg_addr_space !3 !kernel_arg_access_qual !3 !kernel_arg_type !3 !kernel_arg_type_qual !3 !kernel_arg_base_type !3 !spirv.ParameterDecorations !3 { entry: %myval.i = alloca float, align 4 %myval.ascast.i = addrspacecast ptr %myval.i to ptr addrspace(4) store float 0.000000e+00, ptr %myval.i, align 4 %call.i.i.i = call spir_func float @_Z10atomic_maxPU3AS4Vff(ptr addrspace(4) %myval.ascast.i, float 1.230000e+02) #0 ret void }

; Function Attrs: nounwind declare spir_func float @_Z10atomic_maxPU3AS4Vff(ptr addrspace(4), float) #0

attributes #0 = { nounwind }

!spirv.MemoryModel = !{!0} !opencl.enable.FP_CONTRACT = !{} !spirv.Source = !{!1} !opencl.spir.version = !{!2} !opencl.ocl.version = !{!1} !opencl.used.extensions = !{!3} !opencl.used.optional.core.features = !{!3} !spirv.Generator = !{!4}

!0 = !{i32 2, i32 2} !1 = !{i32 0, i32 0} !2 = !{i32 1, i32 2} !3 = !{} !4 = !{i16 6, i16 14}


There is a correct declaration and call to `@_Z10atomic_maxPU3AS4Vff`, however, generation of the device-dependent OpenCL program binary fails due to the wrong OpenCL version:

!opencl.ocl.version = !{!1} !1 = !{i32 0, i32 0}


**As a side note, this invalid version would not be a problem for AOT/JIT when using the atomic buildin with global address space, but for generic address space evidently there are requirements to the version.**

We can see from `ocl-ver-crash.spv ` contents why Translator creates a wrong OpenCL version:

... OpSource Unknown 0 ...


Such an input along with a logic of SPIRVToLLVM::transSourceLanguage():
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/blob/e1f7ebee5ed8b9021236d1d5e2292bcb45df6988/lib/SPIRV/SPIRVReader.cpp#L4823
lead to 0.0 OpenCL version.

This is the first part of the issue. It has a clear path forward how to fix it, by not allowing producing invalid OpenCL version in SPIRVToLLVM::transSourceLanguage().

2. Let's set source language version (1.0) inside SPIRV file as the following and create a new SPIRV input `ocl-ver-crash-ver1.0.spv`:

OpSource OpenCL_C 100000


Translator generates a valid OpenCL 1.0 version this time, and refers to `OpenCL_C` source language (that is `3` in `!1 = !{i32 3, i32 100000}`):

... %call.i.i.i = call spir_func float @_Z10atomic_maxPU3AS4Vff(ptr addrspace(4) %myval.ascast.i, float 1.230000e+02) #0 ... declare spir_func float @_Z10atomic_maxPU3AS4Vff(ptr addrspace(4), float) #0 ... !spirv.Source = !{!1} !1 = !{i32 3, i32 100000} !opencl.ocl.version = !{!3} !3 = !{i32 1, i32 0}


Running `opencl-aot ocl-ver-crash-ver1.0.spv --device=cpu --cmd=build` we get the same diagnostic, as before, no changes to better:

CompilerException Failed to lookup symbol foo JIT session error: Symbols not found: [ _Z10atomic_maxPU3AS4Vff ] Failed to materialize symbols: { (main, { foo }) }


Let's get back to SPIRV input and set OpSource using CPP rather than C source language, and create a new SPIRV input `ocl-ver-crash-ver1.0CPP.spv`:

OpSource OpenCL_CPP 100000


Output LLVM IR after `llvm-spirv ocl-ver-crash-ver1.0CPP.spv -r -o - | llvm-dis -o ocl-ver-crash-ver1.0CPP.ll` is:

!spirv.Source = !{!1} !opencl.ocl.version = !{!3} !1 = !{i32 4, i32 100000} !3 = !{i32 1, i32 0}


The only change is a reference to CPP language (value `4`):

git diff ocl-ver-crash-ver1.0.ll ocl-ver-crash-ver1.0CPP.ll diff --git a/ocl-ver-crash-ver1.0.ll b/ocl-ver-crash-ver1.0CPP.ll index 458526c..c00c6e2 100644 --- a/ocl-ver-crash-ver1.0.ll +++ b/ocl-ver-crash-ver1.0CPP.ll

!0 = !{i32 2, i32 2} -!1 = !{i32 3, i32 100000} +!1 = !{i32 4, i32 100000} !2 = !{i32 1, i32 2}



Let's try the new version of SPIRV input (`ocl-ver-crash-ver1.0CPP.spv`). Now `opencl-aot ocl-ver-crash-ver1.0CPP.spv --device=cpu --cmd=build` executes successfully, the crash has gone.

This chain of input SPIRV transformation shows a dependency of the issue on C/C++ source language, independent from OpenCL version issue, suggests C on mangling vs. C++ mangling, and it also may (or may not) be related to https://github.com/KhronosGroup/SPIRV-LLVM-Translator/issues/2513.

_Note, that there is no user-defined mangling neither in the input LLVM IR, where builtin is referred to as `@_Z21__spirv_AtomicFMaxEXT`, not in generated SPIR-V code, where it's transformed to `OpAtomicFMaxEXT`. This points out to the second issue in Translator that needs further investigation and maybe further discussion._

FYI @MrSidims @svenvh @asudarsa 
asudarsa commented 3 months ago

Hi @VyacheslavLevytskyy

Thanks for opening this issue. I think this documentation in SPIR-V spec needs to be considered in this discussion. https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpSource

It says: OpSource - Document what source language and text this module was translated from. This has no semantic impact and can safely be removed from a module.

Thanks

VyacheslavLevytskyy commented 3 months ago

Thank you @asudarsa , a good point! For me it sounds like an argument that a difference just in OpSource parameters (or absence of OpSource) can't be tolerated as a reason of crash of device-dependent binaries building. In other words, Translator has to account for the attached reproducer in a way that guarantees successful building of the reproducer with all mentioned in the description options for OpSource arguments: without a version (or without OpSource itself), with a version but C language instead of CPP, and with a version and CPP source language.

MrSidims commented 3 months ago

Thanks for the report. Lets go step by step.

  1. I agree with Arvind, OpSource must not affect semantics of the program and used for debug info only. Yet right now it affects translation flow, otherwise why we would need to commit https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/80dfd864ae556369747074bbde8fc1d4c48547be ? So we have to either change this logic or live with it a while longer.
  2. "JIT session error: Symbols not found: [ _Z10atomic_maxPU3AS4Vff ]" so this error comes from some device compiler, right? So some device compiler relies on the metadata to (I assume) to define which builtin library to load and link. If we use another device compiler, would we see the same behavior or not? Either way it's also a good idea to talk with some device compiler developers and clarify if my theory is correct.
  3. While OpSource doesn't carry semantic information, I'm not so sure about the metadata, https://registry.khronos.org/SPIR/specs/spir_spec-2.0.pdf doesn't say it's explicitly, also it doesn't say explicitly if it's obligatory metadata to be generated by frontend or not (at least I can't find such wording). Lets assume the worst case scenario and that is obligatory metadata. If so, is the translator obliged to generate it using OpenCL version and if yes, what is the default value? I don't have a strong opinion about that.

While I was writing the comment Slava has posted one more, so:

can't be tolerated as a reason of crash of device-dependent binaries building

Not a crash, but unresolved symbol after supposed linking with builtin library, which is failure in device's compiler, and following that:

Translator has to account for the attached reproducer in a way that guarantees successful building of the reproducer with all mentioned in the description

How? We don't know the target device and with a further extend its compiler and this compiler's implementation details (dependency on opencl.version metadata is an implementation detail).

I agree, that if we generate opencl.version metadata, then it's value should be not 0.0, but something more valid. What I'm not sure about is that if we must to generate it and what would be the default value. Probably better examination of https://registry.khronos.org/SPIR/specs/spir_spec-2.0.pdf is required. @svenvh may be you have some ideas?

svenvh commented 3 months ago

I wonder if we should avoid relying on metadata at all, and instead tell the translator through a command line option what to do? See also #792.

MrSidims commented 3 months ago

we should avoid relying on metadata at all

Definitely! It should be indeed looking like OpenCL's -cl-std option. Yet what is unclear to me is that if in this very case SPIRVToOCL makes something funny, or is it just about device's compiler way to link builtin libraries? For instance _Z10atomic_maxPU3AS4Vff suggests, that it uses Generic pointer, which was introduced in OpenCL 2.0 and hence without the metadata it might so happen, that 1.0 library is linked, which wouldn't have AS4 mangling. Probably this option may also control this metadata generation (if it's obligatory, yet after reading spir spec I'm unsure).

asudarsa commented 3 months ago

I agree with Arvind, OpSource must not affect semantics of the program and used for debug info only. Yet right now it affects translation flow, otherwise why we would need to commit https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/80dfd864ae556369747074bbde8fc1d4c48547be ? So we have to either change this logic or live with it a while longer.

One quick pointer here. https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/80dfd864ae556369747074bbde8fc1d4c48547be does not rely on OpSource. It relies on metadata provided in LLVM IR. I think we should avoid relying on OpSource during reverse translation. But we can rely on LLVM metadata for forward translation. If we stop relying on metadata during forward translation, we have to revert/fix the entire OCLToSPIRV flow which is predicated by checking if source is OpenCL C.

asudarsa commented 3 months ago

I agree with Arvind, OpSource must not affect semantics of the program and used for debug info only. Yet right now it affects translation flow, otherwise why we would need to commit 80dfd86 ? So we have to either change this logic or live with it a while longer.

One quick pointer here. 80dfd86 does not rely on OpSource. It relies on metadata provided in LLVM IR. I think we should avoid relying on OpSource during reverse translation. But we can rely on LLVM metadata for forward translation. If we stop relying on metadata during forward translation, we have to revert/fix the entire OCLToSPIRV flow which is predicated by checking if source is OpenCL C.

+1 for using -cl-std like option instead of metadata.

VyacheslavLevytskyy commented 3 months ago

I agree with Arvind, OpSource must not affect semantics of the program and used for debug info only. Yet right now it affects translation flow, otherwise why we would need to commit 80dfd86 ? So we have to either change this logic or live with it a while longer.

One quick pointer here. 80dfd86 does not rely on OpSource. It relies on metadata provided in LLVM IR. I think we should avoid relying on OpSource during reverse translation. But we can rely on LLVM metadata for forward translation. If we stop relying on metadata during forward translation, we have to revert/fix the entire OCLToSPIRV flow which is predicated by checking if source is OpenCL C.

+1 for using -cl-std like option instead of metadata.

Yes, I'd also vote for this option.

haonanya commented 1 week ago

opencl-aot ocl-ver-crash.spv --device=cpu --cmd=build doesn't offer openCL version, by default it's CL1.2. When translate spirv to llvm IR, llvm-spirv -r --spirv-target-env=CL1.2 ocl-ver-crash.spv is performed. c++filt _Z10atomic_maxPU3AS4Vff gets atomic_max(float volatile AS4*, float) which is not found in https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/opencl-c.h. opencl-aot ocl-ver-crash.spv --device=cpu --cmd=build --bo="-cl-std=CL2.0" doesn't crash. https://github.com/KhronosGroup/SPIRV-LLVM-Translator/blob/main/test/extensions/EXT/SPV_EXT_shader_atomic_float_/AtomicFMaxEXT.ll#L8 has output for different spirv-target-env.