JuliaGPU / oneAPI.jl

Julia support for the oneAPI programming toolkit.
https://juliagpu.org/oneapi/
Other
184 stars 22 forks source link

PVC support #406

Closed maleadt closed 7 months ago

maleadt commented 7 months ago

This issue is to keep track of oneAPI.jl support for PVC hardware.

Remaining issues:

maleadt commented 7 months ago

oneMKL issue:

terminate called after throwing an instance of 'sycl::_V1::exception'
what():  Level-Zero error:700000041879048196
On device: 'Intel(R) Data Center GPU Max 1550'
in kernel: oneapi::mkl::blas::sgemm_itcopy
      From worker 16:
[85716] signal (6.-6): Aborted
in expression starting at none:1
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
__verbose_terminate_handler at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/vterminate.cc:95
__terminate at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
terminate at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
__cxa_throw at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/eh_throw.cc:98
_ZN6oneapi3mkl3gpu13build_programEPiPN4sycl3_V15queueEPvS7_iPKcS9_mcS9_Pb at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpuL22mkl_gpu_get_kernel_extEPiPN4sycl3_V15queueEiPKcS8_mcS8_S8_S8_mPKvmbb at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu24mkl_gpu_get_spirv_kernelEPiPN4sycl3_V15queueEiPK22mkl_gpu_spirv_kernel_tPKcSB_ at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu40mkl_blas_gpu_sgemm_copybased_driver_syclEPiPN4sycl3_V15queueEPNS1_14blas_arg_usm_tEP20mkl_gpu_event_list_t at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu30mkl_blas_gpu_sgemm_driver_syclEPiPN4sycl3_V15queueEPNS1_14blas_arg_usm_tEP20mkl_gpu_event_list_t at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu19sgemm_sycl_internalEPN4sycl3_V15queueE10MKL_LAYOUT13MKL_TRANSPOSES7_lllNS0_16value_or_pointerIfEEPKflSB_lS9_PflNS0_4blas12compute_modeERKSt6vectorINS3_5eventESaISG_EElll at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu10sgemm_syclEPN4sycl3_V15queueE10MKL_LAYOUT13MKL_TRANSPOSES7_lllNS0_16value_or_pointerIfEEPKflSB_lS9_PflNS0_4blas12compute_modeERKSt6vectorINS3_5eventESaISG_EElll at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl4blas5sgemmERN4sycl3_V15queueE10MKL_LAYOUTNS0_9transposeES7_lllNS0_16value_or_pointerIfEEPKflSB_lS9_PflNS1_12compute_modeERKSt6vectorINS3_5eventESaISF_EE at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl4blas12column_major4gemmERN4sycl3_V15queueENS0_9transposeES7_lllNS0_16value_or_pointerIfEEPKflSB_lS9_PflNS1_12compute_modeERKSt6vectorINS4_5eventESaISF_EE at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
onemklSgemm at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/deps/lib/liboneapi_support.so (unknown line)
onemklSgemm at /home/sdp/.julia/dev/oneAPI/lib/support/liboneapi_support.jl:435
unknown function (ip: 0x7f9830cb93ae)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
gemm! at /home/sdp/.julia/dev/oneAPI/lib/mkl/wrappers_blas.jl:1133
generic_matmatmul! at /home/sdp/.julia/dev/oneAPI/lib/mkl/linalg.jl:224
mul! at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/LinearAlgebra/src/matmul.jl:263 [inlined]
mul! at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/LinearAlgebra/src/matmul.jl:237
unknown function (ip: 0x7f9830cb8e10)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/builtins.c:768
#compare#10 at /home/sdp/.julia/packages/GPUArrays/Hd5Sk/test/testsuite.jl:44
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/builtins.c:768
compare at /home/sdp/.julia/packages/GPUArrays/Hd5Sk/test/testsuite.jl:38
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
macro expansion at /home/sdp/.julia/packages/GPUArrays/Hd5Sk/test/testsuite/linalg.jl:290 [inlined]
macro expansion at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/Test/src/Test.jl:669 [inlined]
macro expansion at /home/sdp/.julia/packages/GPUArrays/Hd5Sk/test/testsuite/linalg.jl:290 [inlined]
macro expansion at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/Test/src/Test.jl:1669 [inlined]
#404 at /home/sdp/.julia/packages/GPUArrays/Hd5Sk/test/testsuite/linalg.jl:278
unknown function (ip: 0x7f98a3988ec9)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
#test_linalg_mul!_vector-matrix#403 at /home/sdp/.julia/packages/GPUArrays/Hd5Sk/test/testsuite.jl:80
test_linalg_mul!_vector-matrix at /home/sdp/.julia/packages/GPUArrays/Hd5Sk/test/testsuite.jl:80
unknown function (ip: 0x7f98a3984e15)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
#16 at /home/sdp/.julia/dev/oneAPI/test/runtests.jl:84
macro expansion at /home/sdp/.julia/dev/oneAPI/test/setup.jl:52 [inlined]
macro expansion at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
macro expansion at /home/sdp/.julia/dev/oneAPI/test/setup.jl:52 [inlined]
macro expansion at ./timing.jl:503 [inlined]
top-level scope at /home/sdp/.julia/dev/oneAPI/test/setup.jl:51
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/toplevel.c:925
ijl_toplevel_eval_in at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
runtests at /home/sdp/.julia/dev/oneAPI/test/setup.jl:55
unknown function (ip: 0x7f98a3975889)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
jl_f__call_latest at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/builtins.c:812
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/builtins.c:768
#invokelatest#2 at ./essentials.jl:892
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/builtins.c:768
invokelatest at ./essentials.jl:889
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/builtins.c:768
#110 at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:287
run_work_thunk at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:70
#109 at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:287
unknown function (ip: 0x7f98a3973e12)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
start_task at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/task.c:1238
Allocations: 80320476 (Pool: 80212250; Big: 108226); GC: 117

ref https://github.com/oneapi-src/oneMKL/issues/308

amontoison commented 7 months ago

I have a similar issue in #411 with the routine getrs_batched!:

      From worker 2:    terminate called after throwing an instance of 'sycl::_V1::exception'
      From worker 2:      what():  OpenCL error -1
      From worker 2:
      From worker 2:    signal (6): Aborted
      From worker 2:    in expression starting at /var/lib/buildkite-agent/builds/sagittarius-maleadt-net/julialang/oneapi-dot-jl/test/onemkl.jl:1
      From worker 2:    gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
      From worker 2:    abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
      From worker 2:    __verbose_terminate_handler at /workspace/srcdir/gcc-12.1.0/libstdc++-v3/libsupc++/vterminate.cc:95
      From worker 2:    __terminate at /workspace/srcdir/gcc-12.1.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
      From worker 2:    terminate at /workspace/srcdir/gcc-12.1.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
      From worker 2:    __cxa_throw at /workspace/srcdir/gcc-12.1.0/libstdc++-v3/libsupc++/eh_throw.cc:98
      From worker 2:    _ZN6oneapi3mkl3gpu20mkl_gpu_map_l0_to_clEPiP19_ze_device_handle_tPP13_cl_device_idPP11_cl_context at /root/.cache/julia-buildkite-plugin/depots/05c310bb-cd1d-4bbf-ad62-61f2372c55f0/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libmkl_sycl_blas.so.4 (unknown line)
      From worker 2:    _ZN6oneapi3mkl3gpuL19build_cl_program_l0EPiP19_ze_device_handle_tiPKcS6_S6_PbbPPcPmPN4sycl3_V15queueE at /root/.cache/julia-buildkite-plugin/depots/05c310bb-cd1d-4bbf-ad62-61f2372c55f0/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libmkl_sycl_blas.so.4 (unknown line)
      From worker 2:    _ZN6oneapi3mkl3gpu13build_programEPiPN4sycl3_V15queueEPvS7_iPKcS9_mcS9_Pb at /root/.cache/julia-buildkite-plugin/depots/05c310bb-cd1d-4bbf-ad62-61f2372c55f0/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libmkl_sycl_blas.so.4 (unknown line)
      From worker 2:    _ZN6oneapi3mkl3gpuL22mkl_gpu_get_kernel_extEPiPN4sycl3_V15queueEiPKcS8_mcS8_S8_S8_mPKvmbb at /root/.cache/julia-buildkite-plugin/depots/05c310bb-cd1d-4bbf-ad62-61f2372c55f0/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libmkl_sycl_blas.so.4 (unknown line)
      From worker 2:    _ZN6oneapi3mkl3gpu14get_OCL_kernelEPiPN4sycl3_V15queueEiPKcS8_S8_ at /root/.cache/julia-buildkite-plugin/depots/05c310bb-cd1d-4bbf-ad62-61f2372c55f0/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libmkl_sycl_blas.so.4 (unknown line)
      From worker 2:    _ZN6oneapi3mkl6lapack8internal9set_rangeERN4sycl3_V15queueERKSt6vectorINS4_5eventESaIS8_EEPS8_lPllll at /root/.cache/julia-buildkite-plugin/depots/05c310bb-cd1d-4bbf-ad62-61f2372c55f0/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libmkl_sycl_lapack.so.4 (unknown line)
      From worker 2:    _ZN6oneapi3mkl6lapack11getrs_batchERN4sycl3_V15queueEPNS0_9transposeEPlS8_PPfS8_PS8_SA_S8_lS8_S9_lRKSt6vectorINS3_5eventESaISD_EE at /root/.cache/julia-buildkite-plugin/depots/05c310bb-cd1d-4bbf-ad62-61f2372c55f0/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libmkl_sycl_lapack.so.4 (unknown line)
      From worker 2:    onemklSgetrs_batch at /workspace/srcdir/oneAPI.jl/deps/src/onemkl.cpp:3551
      From worker 2:    onemklSgetrs_batch at /var/lib/buildkite-agent/builds/sagittarius-maleadt-net/julialang/oneapi-dot-jl/lib/support/liboneapi_support.jl:4568
      From worker 2:    unknown function (ip: 0x7ff3965016af)
      From worker 2:    _jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
      From worker 2:    getrs_batched! at /var/lib/buildkite-agent/builds/sagittarius-maleadt-net/julialang/oneapi-dot-jl/lib/mkl/wrappers_lapack.jl:455
      From worker 2:    _jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
      From worker 2:    jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
      From worker 2:    do_call at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:126
      From worker 2:    eval_value at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:215
      From worker 2:    eval_body at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:467
      From worker 2:    eval_body at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:522
      From worker 2:    eval_body at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:522
      From worker 2:    eval_body at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:522
      From worker 2:    eval_body at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:522
      From worker 2:    eval_body at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:522
      From worker 2:    eval_body at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:522
      From worker 2:    jl_interpret_toplevel_thunk at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:750
      From worker 2:    jl_toplevel_eval_flex at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:906
      From worker 2:    jl_toplevel_eval_flex at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:850
      From worker 2:    ijl_toplevel_eval_in at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:965
      From worker 2:    eval at ./boot.jl:368 [inlined]
      From worker 2:    include_string at ./loading.jl:1428
      From worker 2:    _jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
      From worker 2:    _include at ./loading.jl:1488
      From worker 2:    include at ./client.jl:476 [inlined]
      From worker 2:    #11 at /var/lib/buildkite-agent/builds/sagittarius-maleadt-net/julialang/oneapi-dot-jl/test/runtests.jl:78 [inlined]
      From worker 2:    macro expansion at /var/lib/buildkite-agent/builds/sagittarius-maleadt-net/julialang/oneapi-dot-jl/test/setup.jl:52 [inlined]
      From worker 2:    macro expansion at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Test/src/Test.jl:1363 [inlined]
      From worker 2:    macro expansion at /var/lib/buildkite-agent/builds/sagittarius-maleadt-net/julialang/oneapi-dot-jl/test/setup.jl:52 [inlined]
      From worker 2:    macro expansion at ./timing.jl:463 [inlined]
      From worker 2:    top-level scope at /var/lib/buildkite-agent/builds/sagittarius-maleadt-net/julialang/oneapi-dot-jl/test/setup.jl:51
      From worker 2:    jl_toplevel_eval_flex at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:897
      From worker 2:    ijl_toplevel_eval_in at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/toplevel.c:965
      From worker 2:    eval at ./boot.jl:368 [inlined]
      From worker 2:    runtests at /var/lib/buildkite-agent/builds/sagittarius-maleadt-net/julialang/oneapi-dot-jl/test/setup.jl:55
      From worker 2:    unknown function (ip: 0x7ff40d752746)
      From worker 2:    _jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
      From worker 2:    jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
      From worker 2:    jl_f__call_latest at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/builtins.c:774
      From worker 2:    _jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
      From worker 2:    jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
      From worker 2:    do_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/builtins.c:730
      From worker 2:    #invokelatest#2 at ./essentials.jl:729
      From worker 2:    _jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
      From worker 2:    jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
      From worker 2:    do_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/builtins.c:730
      From worker 2:    invokelatest at ./essentials.jl:726
      From worker 2:    _jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
      From worker 2:    jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
      From worker 2:    do_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/builtins.c:730
      From worker 2:    #110 at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:285
      From worker 2:    run_work_thunk at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:70
      From worker 2:    macro expansion at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:285 [inlined]
      From worker 2:    #109 at ./task.jl:484
      From worker 2:    unknown function (ip: 0x7ff40d7510ff)
      From worker 2:    _jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
      From worker 2:    jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
      From worker 2:    start_task at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/task.c:931
maleadt commented 7 months ago

Couple of breadcrumbs on the MKL PVC issue (also serving as documentation for myself).


@kballeda tried oneAPI 2024.1.0, but that gave the same error.


I tried building the support library with the local compiler on IDC:

withenv("PATH"=>"$(ENV["PATH"]):/opt/intel/oneapi/2024.0/bin/",
        "LD_LIBRARY_PATH"=>"/opt/intel/oneapi/2024.0/lib") do
    cmake() do cmake_path
    ninja() do ninja_path
        run(```$cmake_path -DCMAKE_CXX_COMPILER="icpx"
                           -DCMAKE_CXX_FLAGS="-fsycl -isystem /opt/intel/oneapi/2024.0/include -isystem $include_dir"
                           -DCMAKE_SHARED_LINKER_FLAGS="-L/opt/intel/oneapi/2024.0/lib"
                           -DCMAKE_INSTALL_RPATH="/opt/intel/oneapi/2024.0/lib"
                           -DCMAKE_INSTALL_PREFIX=$install_dir
                           -GNinja -S $(@__DIR__) -B $build_dir```)
        run(`$ninja_path -C $(build_dir) install`)
    end
    end
end

That resulted in the same error. So the issue is probably not with the oneAPI distribution on Conda.


I then wrote a C program that links against ze_loader and oneapi_support, doing everything that oneAPI.jl does:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include "level_zero/ze_api.h"
#include "deps/src/onemkl.h"
#include "deps/src/sycl.h"

int main() {
    ze_result_t result;
    ze_driver_handle_t driver = 0;
    ze_device_handle_t device = 0;
    ze_context_handle_t context = 0;
    ze_command_queue_handle_t queue = 0;

    // Initialize oneAPI Level Zero
    result = zeInit(0);
    assert(result == ZE_RESULT_SUCCESS);

    // Initialize the driver
    uint32_t driver_count = 0;
    result = zeDriverGet(&driver_count, NULL);
    assert(result == ZE_RESULT_SUCCESS && driver_count > 0);
    result = zeDriverGet(&driver_count, &driver);
    assert(result == ZE_RESULT_SUCCESS);

    // Create a context
    ze_context_desc_t context_desc = {
        .stype = ZE_STRUCTURE_TYPE_CONTEXT_DESC,
        .pNext = NULL,
        .flags = 0
    };
    result = zeContextCreate(driver, &context_desc, &context);
    assert(result == ZE_RESULT_SUCCESS);

    // Get a device handle
    uint32_t device_count = 0;
    result = zeDeviceGet(driver, &device_count, NULL);
    assert(result == ZE_RESULT_SUCCESS);
    ze_device_handle_t* devices = (ze_device_handle_t*)malloc(device_count * sizeof(ze_device_handle_t));
    result = zeDeviceGet(driver, &device_count, devices);
    assert(result == ZE_RESULT_SUCCESS);
    device = devices[0];

    // Create a command queue
    ze_command_queue_desc_t queue_desc = {
        .stype = ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC,
        .pNext = NULL,
        .ordinal = 0,
        .mode = ZE_COMMAND_QUEUE_MODE_DEFAULT,
        .priority = ZE_COMMAND_QUEUE_PRIORITY_NORMAL,
        .flags = 0
    };
    result = zeCommandQueueCreate(context, device, &queue_desc, &queue);
    assert(result == ZE_RESULT_SUCCESS);

    // Allocate memory
    int m = 10, n = 10, k = 10;
    float *A, *B, *C;
    ze_device_mem_alloc_desc_t alloc_desc = {
        .stype = ZE_STRUCTURE_TYPE_DEVICE_MEM_ALLOC_DESC,
        .pNext = NULL,
        .flags = 0,
        .ordinal = 0
    };
    result = zeMemAllocDevice(context, &alloc_desc, m * k * sizeof(float), 1, device, (void**)&A);
    assert(result == ZE_RESULT_SUCCESS);
    result = zeMemAllocDevice(context, &alloc_desc, k * n * sizeof(float), 1, device, (void**)&B);
    assert(result == ZE_RESULT_SUCCESS);
    result = zeMemAllocDevice(context, &alloc_desc, m * n * sizeof(float), 1, device, (void**)&C);
    assert(result == ZE_RESULT_SUCCESS);

    // Create SYCL objects
    syclPlatform_t sycl_platform = 0;
    syclDevice_t sycl_device = 0;
    syclContext_t sycl_context = 0;
    syclQueue_t sycl_queue = 0;
    syclPlatformCreate(&sycl_platform, driver);
    syclDeviceCreate(&sycl_device, sycl_platform, device);
    syclContextCreate(&sycl_context, &sycl_device, 1, context, 1);
    syclQueueCreate(&sycl_queue, sycl_context, sycl_device, queue, 1);

    // Call MKL's SGEMM function
    float alpha = 1.0;
    float beta = 0.0;
    onemklSgemm(sycl_queue, ONEMKL_TRANSPOSE_NONTRANS, ONEMKL_TRANSPOSE_NONTRANS,
                m, n, k, alpha, A, m, B, k, beta, C, m);

    return 0;
}

Compile and execute:

gcc wip.c -g -o wip -L/home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/deps/lib -lze_loader -loneapi_support && LD_LIBRARY_PATH=/home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/deps/lib ./wip

Here, I'm linking against a locally built liboneapi_support (build using the local toolchain), but it's also possible to link against the one built on Yggdrasil (using the Conda toolchain):

gcc wip.c -g -o wip -L/home/sdp/.julia/artifacts/305749e6c396bed3bac62fdcf1d7f5f6d1accf79/lib -lze_loader -loneapi_support && LD_LIBRARY_PATH=/home/sdp/.julia/artifacts/305749e6c396bed3bac62fdcf1d7f5f6d1accf79/lib ./wip

Interestingly, this C program works successfully with both versions of the support library. That doesn't only confirm again that the toolchain isn't to blame, but also that the library as packaged on Yggdrasil seems to work correctly.


One thing that's different here from regular Julia execution, is the fact that libze_loader (and the libze_intel_gpu back-end) are loaded from the system, and not from our artifacts. That seems to suggest that PVC support is not enabled in our build of NEO/compute-runtime. However, as can be seen in the build logs of NEO v24.09.28717.12: https://buildkite.com/julialang/yggdrasil/builds/9234#018e89f1-94de-414c-ab82-5503091c4df6

[21:22:48] -- All supported platforms:  PVC MTL DG2 ARL TGLLP DG1 RKL ADLS ADLP ADLN ICLLP LKF EHL SKL KBL GLK CFL BXT BDW

Also, I tried to verify this hypothesis by loading the C loader with our builds of NEO/ze_loader/IGC/gmmlib instead, by setting LD_LIBRARY_PATH to the artifact_dirs:

oneAPI.oneL0.oneAPI_Level_Zero_Loader_jll.artifact_dir
"/home/sdp/.julia/artifacts/07d2a0b1b466f4d6fab3f80843bd68cb0036c027

oneAPI.oneL0.NEO_jll.artifact_dir
"/home/sdp/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2

oneAPI.oneL0.NEO_jll.libigc_jll.artifact_dir
"/home/sdp/.julia/artifacts/1fad9b4961d944e4422a7f63a4d2a65421e4e126

oneAPI.oneL0.NEO_jll.gmmlib_jll.artifact_dir
"/home/sdp/.julia/artifacts/be9d1cd776269d571d16522f35ed5c6af4309a4b
gcc wip.c -g -o wip -L/home/sdp/.julia/artifacts/305749e6c396bed3bac62fdcf1d7f5f6d1accf79/lib -lze_loader -loneapi_support && LD_LIBRARY_PATH=/home/sdp/.julia/artifacts/305749e6c396bed3bac62fdcf1d7f5f6d1accf79/lib:/home/sdp/.julia/artifacts/be9d1cd776269d571d16522f35ed5c6af4309a4b/lib:/home/sdp/.julia/artifacts/1fad9b4961d944e4422a7f63a4d2a65421e4e126/lib:/home/sdp/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib:/home/sdp/.julia/artifacts/07d2a0b1b466f4d6fab3f80843bd68cb0036c027/lib ./wip

And this works just fine... A couple libraries, like libigdrcl.so and libigdml.so.1 are still loaded from the system for some reason, but it seems unlikely that this is related.

I'm running out of ideas at this point. What complicates this all, is the plugin systems involved (which I don't fully understand). For all I know that could decide to not load PVC support and have MKL fall back to the (aborting) code path instead, but without a way to reproduce that outside of Julia it's going to be hard to debug this.

maleadt commented 7 months ago

A couple libraries, like libigdrcl.so and libigdml.so.1 are still loaded from the system for some reason, but it seems unlikely that this is related.

About that...

https://github.com/JuliaGPU/oneAPI.jl/blob/f20d471dd86009cde1fd9aff8340e190c39ede07/src/oneAPI.jl#L89

OCL_ICD_VENDORS=/home/sdp/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/intel-opencl/libigdrcl.so ./wip
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Level-Zero error:700000041879048196
On device: 'Intel(R) Data Center GPU Max 1550'
in kernel: oneapi::mkl::blas::sgemm_itcopy
Aborted (core dumped)

Okay, that's enough for today. Now that I've uncovered the root issue, it should hopefully be easier to resolve this.

pengtu commented 7 months ago

A couple libraries, like libigdrcl.so and libigdml.so.1 are still loaded from the system for some reason, but it seems unlikely that this is related.

About that...

https://github.com/JuliaGPU/oneAPI.jl/blob/f20d471dd86009cde1fd9aff8340e190c39ede07/src/oneAPI.jl#L89

OCL_ICD_VENDORS=/home/sdp/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/intel-opencl/libigdrcl.so ./wip
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Level-Zero error:700000041879048196
On device: 'Intel(R) Data Center GPU Max 1550'
in kernel: oneapi::mkl::blas::sgemm_itcopy
Aborted (core dumped)

Okay, that's enough for today. Now that I've uncovered the root issue, it should hopefully be easier to resolve this.

I remember that we added the OCL_ICD_VENDORS environment to direct the libOpenCL.so to load from Julia's artifact repo instead of using system's libigdrcl.so. It was to fix a problem with the system libigcdrcl.so not compatible with what oneMKL build with. It is strange that the libigdrcl.so in Conda is not compatible with oneMKL. I am confused?

pengtu commented 7 months ago

In your LD_DEBUG=libs trace, the C++ execution is using the system's oneAPI 2024 libOpenCL.so where the Julia run uses the one from the Conda artifact:

C++: Running MKL gemm... 128030: find library=libOpenCL.so [0]; searching 128030: search path=/opt/intel/oneapi/mkl/2024.0/lib (LD_LIBRARY_PATH) 128030: trying file=/opt/intel/oneapi/mkl/2024.0/lib/libOpenCL.so 128030: search path=/opt/intel/oneapi/mkl/2024.0/lib:/opt/intel/oneapi/compiler/2024.0/lib (LD_LIBRARY_PATH) 128030: trying file=/opt/intel/oneapi/mkl/2024.0/lib/libOpenCL.so 128030: trying file=/opt/intel/oneapi/compiler/2024.0/lib/libOpenCL.so 128030: 128030: 128030: calling fini: ./gemm [0]

Julia:

129138: find library=libOpenCL.so [0]; searching
129138:  search path=/home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib      (RUNPATH from file /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/deps/lib/liboneapi_support.so)
129138:   trying file=/home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libOpenCL.so

If we remove the OCL_ICD_VENDORS environment setting from the Julia package, Julia will always pick the system's libOpenCL.so. This will require the users to install a compatible OpenCL library on their systems. This is undesirable but if it is the only way to make it work, we will just have to do it and warn the users.

pengtu commented 7 months ago

Here is a link to how OCL_ICD_VENDORS affects the loading of OpenCL library: https://manpages.ubuntu.com/manpages/trusty/en/man7/libOpenCL.so.7.html

pengtu commented 7 months ago

Going back our Level Zero driver build from Intel compute-runtime, have we added the cmake option: -DNEO_ENABLE_i915_PRELIM_DETECTION=TRUE? If not, it could be the problem.

pengtu commented 7 months ago

Going back our Level Zero driver build from Intel compute-runtime, have we added the cmake option: -DNEO_ENABLE_i915_PRELIM_DETECTION=TRUE? If not, it could be the problem.

Never mind, the cmake build flag is set to TRUE according to the NEO build log:

_bk;t=1711712348714 [11:39:08]  ---> CMAKE_FLAGS+=(-DNEO_ENABLE_i915_PRELIM_DETECTION=TRUE)

So the Level Zero driver should be good.

maleadt commented 7 months ago

It is strange that the libigdrcl.so in Conda is not compatible with oneMKL

libigdrcl comes from the compute runtime, so we build it on Yggdrasil.

In your LD_DEBUG=libs trace, the C++ execution is using the system's oneAPI 2024 libOpenCL.so

That doesn't matter though; the C++ reproducer above also fails when using the libOpenCL.so from Conda (which is the one we redistribute as part of the oneAPI support library package).

This is undesirable but if it is the only way to make it work, we will just have to do it and warn the users.

It isn't only undesirable, it also won't work, because the problem seems to lie with libigdrcl.so which comes from the compute-runtime. And I'm not willing to ask from users to have to install the entire oneAPI stack (including compute-runtime, IGC, etc); the whole seamless installability has been an important feature of oneAPI.jl.

maleadt commented 7 months ago

FWIW, here's the easier way to reproduce the issue by compiling and executing the C++ MWE from above using all our libraries:

julia --project -e '
    using oneAPI
    run(`gcc wip.c -g -o wip -L$(oneAPI.oneL0.oneAPI_Level_Zero_Loader_jll.artifact_dir)/lib -L$(oneAPI.Support.oneAPI_Support_jll.artifact_dir)/lib -lze_loader -loneapi_support`)
    withenv("LD_LIBRARY_PATH" => "$(oneAPI.oneL0.oneAPI_Level_Zero_Loader_jll.artifact_dir)/lib:$(oneAPI.Support.oneAPI_Support_jll.artifact_dir)/lib:$(oneAPI.oneL0.NEO_jll.artifact_dir)/lib:$(oneAPI.oneL0.NEO_jll.libigc_jll.artifact_dir)/lib:$(oneAPI.oneL0.NEO_jll.gmmlib_jll.artifact_dir)/lib",
            "OCL_ICD_VENDORS" => "$(oneAPI.oneL0.NEO_jll.libigdrcl)") do
        run(`./wip`)
    end'

Adding LD_DEBUG=libs to the withenv reveals that all libraries are loaded from Julia artifacts, with the exception of libigdml.so.


Given that the PVC platform somehow isn't loaded/supported/whatever, I figured that this may be visible using clinfo too. And indeed:

$ julia --project -e '
    using oneAPI
    withenv("LD_LIBRARY_PATH" => "$(oneAPI.Support.oneAPI_Support_jll.artifact_dir)/lib:$(oneAPI.oneL0.NEO_jll.artifact_dir)/lib:$(oneAPI.oneL0.NEO_jll.libigc_jll.artifact_dir)/lib:$(oneAPI.oneL0.NEO_jll.gmmlib_jll.artifact_dir)/lib",
            "OCL_ICD_VENDORS" => "$(oneAPI.oneL0.NEO_jll.libigdrcl)") do
        run(`clinfo`)
    end'
Number of platforms                               0

Crucially though, removing the oneAPI_Support_jll artifact from that list (which, as relevant to clinfo, should only provide libOpenCL.so) the platform is detected:

$ julia --project -e '
    using oneAPI
    withenv("LD_LIBRARY_PATH" => "$(oneAPI.oneL0.NEO_jll.artifact_dir)/lib:$(oneAPI.oneL0.NEO_jll.libigc_jll.artifact_dir)/lib:$(oneAPI.oneL0.NEO_jll.gmmlib_jll.artifact_dir)/lib",
            "OCL_ICD_VENDORS" => "$(oneAPI.oneL0.NEO_jll.libigdrcl)") do
        run(`clinfo`)
    end'
Number of platforms                               1

So it may be an issue with the Conda-provided libOpenCL.so after all? Or oneAPI_Support_jll is somehow breaking PVC OpenCL in another way.

Here's the LD_DEBUG=libs for both: https://gist.github.com/maleadt/16ca4481e887e91a9beb5fb74a56f5f7

maleadt commented 7 months ago

One of the strange things about the Conda libOpenCL we ship is that it has way more dependencies than I would expect:

$ libtree /home/sdp/.julia/artifacts/305749e6c396bed3bac62fdcf1d7f5f6d1accf79/lib/libOpenCL.so
libOpenCL.so.1
├── libintlc.so.5 [rpath]
├── libirng.so [rpath]
│   └── libintlc.so.5 [rpath]
├── libsvml.so [rpath]
│   └── libintlc.so.5 [rpath]
└── libimf.so [rpath]
    └── libintlc.so.5 [rpath]

vs. the system one, having none:

libtree /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
libOpenCL.so.1

@pengtu Do you know why the OpenCL ICD loader we pull from the intel-opencl-rt package on Conda is so complex? I'm going to try and see if using a generic ICD loader works better.

maleadt commented 7 months ago

Summary of current status:

@pengtu Can you figure out what the deal is here? As far as I can tell the generic ICD loader works fine wrt. both PVC support and the SYCL Plugin Interface, and it seems like oneMKL is doing strange things with libOpenCL.so that are hard/impossible to debug.