Closed csantosb closed 7 months ago
Make sure you have the necessary permissions to access the GPU hardware in /dev/dri*
. You didn't mention which kernel version you are using, is it sufficiently recent (see the README, I think you need 6.2 at the least)? You can also try strace
ing the process to see if anything goes wrong.
Apart from those suggestions though, there's not much we can do, the API being as opaque as it is. If you don't figure it out, it may be best to open an issue on https://github.com/intel/compute-runtime.
I'm using up to date archlinux, and kernel 6.8.1, with official julia binaries.
I'll try your suggestions, thanks. Not sure how to explain the issue upstream, though.
After fixing /dev
permissions, please post a strace
. Maybe we can see what's up in there.
And maybe also a run with LD_DEBUG=libs
.
mer. 20 mars 2024 at 08:39, Tim Besard @.***> wrote:
After fixing /dev permissions, please post a strace. Maybe we can see what's up in there.
And maybe also a run with LD_DEBUG=libs.
Here we go:
https://git.sr.ht/~csantosb/traces/tree/69d47a88865754482c216b466a758343b216779b
contains output of
strace -o trace.txt /tmp/julia-1.10.2/bin/julia -e "using oneAPI"
and
export LD_DEBUG=libs; /tmp/julia-1.10.2/bin/julia -e "using oneAPI" 2> trace2.txt
Thanks for your help
It looks like you have some Level Zero things installed globally:
42930: find library=libze_tracing_layer.so.1 [0]; searching
42930: search cache=/etc/ld.so.cache
42930: trying file=/usr/lib/libze_tracing_layer.so.1
42930:
42930:
42930: calling init: /usr/lib/libze_tracing_layer.so.1
openat(AT_FDCWD, "/usr/lib/libze_tracing_layer.so.1", O_RDONLY|O_CLOEXEC) = 20
On my system, it's after that (normally failed) discovery that /dev/dri
is scanned:
openat(AT_FDCWD, "/usr/lib/libze_tracing_layer.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
munmap(0x7ff195395000, 34618) = 0
openat(AT_FDCWD, "/dev/dri/by-path", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 17
That doesn't happen in your strace, so I'd guess that the mixing of our libze
with your system libze_validation
makes the whole thing bail out early.
Could you try removing those system libraries, if only temporarily? Generally, the LD_DEBUG=libs
output shouldn't be loading any system libraries (except for core ones like libc, libm, libpthread, etc).
That doesn't happen in your strace, so I'd guess that the mixing of our libze with your system libze_validation makes the whole thing bail out early.
Could you try removing those system libraries, if only temporarily? Generally, the LD_DEBUG=libs output shouldn't be loading any system libraries (except for core ones like libc, libm, libpthread, etc).
Good point, thanks !
Now, don’t have any package related to oneapi in my system
sudo pacman -Fy libze_intel_vpu.so.1 sudo pacman -Fy libze_tracinglayer.so.1 sudo pacman -Fx libze
:: Synchronizing package databases... core is up to date extra is up to date :: Synchronizing package databases... core is up to date extra is up to date extra/level-zero-loader 1.15.1-1 usr/lib/libze_tracing_layer.so.1 extra/intel-compute-runtime 23.48.27912.11-1 usr/lib/libze_intel_gpu.so usr/lib/libze_intel_gpu.so.1 usr/lib/libze_intel_gpu.so.1.3.27912 extra/intel-oneapi-basekit 2024.0.0.49564-2 opt/intel/oneapi/2024.0/lib/libze_trace_collector.so opt/intel/oneapi/compiler/2024.0/lib/libze_trace_collector.so extra/level-zero-headers 1.15.1-1 usr/lib/pkgconfig/libze_loader.pc extra/level-zero-loader 1.15.1-1 usr/lib/libze_loader.so usr/lib/libze_loader.so.1 usr/lib/libze_loader.so.1.15.1 usr/lib/libze_tracing_layer.so usr/lib/libze_tracing_layer.so.1 usr/lib/libze_tracing_layer.so.1.15.1 usr/lib/libze_validation_layer.so usr/lib/libze_validation_layer.so.1 usr/lib/libze_validation_layer.so.1.15.1
sudo updatedb locate libze_intel_vpu.so.1 locate libze_tracing_layer.so.1
/home/csantos/.julia/artifacts/521996985d539cc752bbc959f2fd92df020356dc/lib/libze_tracing_layer.so.1 /home/csantos/.julia/artifacts/521996985d539cc752bbc959f2fd92df020356dc/lib/libze_tracing_layer.so.1.16.1
So my system is clean, and this is what I obtain, which is closer to what you get
https://git.sr.ht/~csantosb/traces/tree/6d99b628d22feb97557dff084bf0bcd16ce914cc
OK great, despite the error being the same we do actually see libze
scanning /dev
now, indicating that the tracing layer mismatch was problematic in the first place.
openat(AT_FDCWD, "/dev/dri/by-path", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 20
fstat(20, {st_mode=S_IFDIR|0755, st_size=80, ...}) = 0
getdents64(20, 0x365f5f0 /* 4 entries */, 32768) = 144
getdents64(20, 0x365f5f0 /* 0 entries */, 32768) = 0
close(20) = 0
openat(AT_FDCWD, "/dev/dri/by-path/pci-0000:00:02.0-render", O_RDWR) = 20
ioctl(20, DRM_IOCTL_VERSION, 0x7fff8e242e80) = 0
ioctl(20, DRM_IOCTL_I915_GETPARAM, 0x7fff8e242ff0) = 0
ioctl(20, DRM_IOCTL_I915_GETPARAM, 0x7fff8e242ff0) = 0
openat(AT_FDCWD, "/sys/bus/pci/devices/0000:00:02.0/drm", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 21
fstat(21, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
getdents64(21, 0x365f5f0 /* 5 entries */, 32768) = 144
getdents64(21, 0x365f5f0 /* 0 entries */, 32768) = 0
close(21) = 0
openat(AT_FDCWD, "/sys/bus/pci/devices/0000:00:02.0/drm/card1/prelim_uapi_version", O_RDONLY) = -1 ENOENT (Aucun fichier ou dossier de ce type)
ioctl(20, DRM_IOCTL_I915_QUERY, 0x7fff8e242eb0) = 0
ioctl(20, DRM_IOCTL_I915_QUERY, 0x7fff8e242eb0) = 0
ioctl(20, DRM_IOCTL_I915_QUERY, 0x7fff8e242f80) = 0
ioctl(20, DRM_IOCTL_I915_GETPARAM, 0x7fff8e243020) = 0
ioctl(20, DRM_IOCTL_I915_GEM_CONTEXT_SETPARAM, 0x7fff8e243060) = -1 EINVAL (Argument invalide)
ioctl(20, DRM_IOCTL_I915_QUERY, 0x7fff8e242f90) = 0
ioctl(20, DRM_IOCTL_I915_QUERY, 0x7fff8e242f90) = 0
ioctl(20, DRM_IOCTL_I915_QUERY, 0x7fff8e242f30) = 0
ioctl(20, DRM_IOCTL_I915_QUERY, 0x7fff8e242f30) = 0
futex(0x366a858, FUTEX_WAKE_PRIVATE, 2147483647) = 0
ioctl(20, DRM_IOCTL_I915_GEM_VM_CREATE, 0x7fff8e242ff0) = 0
ioctl(20, DRM_IOCTL_I915_QUERY, 0x7fff8e242710) = 0
ioctl(20, DRM_IOCTL_I915_QUERY, 0x7fff8e242710) = 0
ioctl(20, DRM_IOCTL_I915_GEM_CONTEXT_GETPARAM, 0x7fff8e242840) = 0
openat(AT_FDCWD, "/sys/bus/pci/devices/0000:00:02.0/drm", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 21
fstat(21, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
getdents64(21, 0x365f5f0 /* 5 entries */, 32768) = 144
getdents64(21, 0x365f5f0 /* 0 entries */, 32768) = 0
close(21) = 0
openat(AT_FDCWD, "/sys/bus/pci/devices/0000:00:02.0/drm/card1/gt_max_freq_mhz", O_RDONLY) = 21
read(21, "1300\n", 8191) = 5
close(21) = 0
ioctl(20, DRM_IOCTL_I915_GEM_CONTEXT_GETPARAM, 0x7fff8e242830) = 0
ioctl(20, DRM_IOCTL_I915_GEM_CONTEXT_GETPARAM, 0x7fff8e242850) = 0
ioctl(20, DRM_IOCTL_I915_GETPARAM, 0x7fff8e2427e0) = 0
readlink("/proc/self/exe", "/tmp/julia-1.10.2/bin/julia", 511) = 27
futex(0x36a3840, FUTEX_WAKE_PRIVATE, 2147483647) = 0
ioctl(20, DRM_IOCTL_I915_GEM_VM_DESTROY, 0x7fff8e2431e0) = 0
close(20) = 0
I don't see anything stand out here. There's an DRM_IOCTL_I915_GEM_CONTEXT_SETPARAM
returning EINVAL, but more queries are made after that, so it doesn't seem fatal.
Maybe also try running with ZE_ENABLE_LOADER_DEBUG_TRACE=
, according to https://github.com/oneapi-src/level-zero?tab=readme-ov-file#debug-trace.
❯ ZE_ENABLE_LOADER_DEBUG_TRACE=1 jl --project examples/vadd.jl
ZE_LOADER_DEBUG_TRACE:Loading Driver libze_intel_gpu.so.1
ZE_LOADER_DEBUG_TRACE:Loading Driver libze_intel_vpu.so.1
ZE_LOADER_DEBUG_TRACE:Load Library of libze_intel_vpu.so.1 failed with libze_intel_vpu.so.1: cannot open shared object file: No such file or directory
ZE_LOADER_DEBUG_TRACE:Load Library of libze_tracing_layer.so.1 failed with libze_tracing_layer.so.1: cannot open shared object file: No such file or directory
ZE_LOADER_DEBUG_TRACE:check_drivers(flags=0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED))
ZE_LOADER_DEBUG_TRACE:init driver libze_intel_gpu.so.1 zeInit(0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) returning ZE_RESULT_SUCCESS
Maybe also try running with ZE_ENABLE_LOADER_DEBUG_TRACE= , according to https://github.com/oneapi-src/level-zero?tab=readme-ov-file#debug-trace.
❯ ZE_ENABLE_LOADER_DEBUG_TRACE=1 jl --project examples/vadd.jl ZE_LOADER_DEBUG_TRACE:Loading Driver libze_intel_gpu.so.1 ZE_LOADER_DEBUG_TRACE:Loading Driver libze_intel_vpu.so.1 ZE_LOADER_DEBUG_TRACE:Load Library of libze_intel_vpu.so.1 failed with libze_intel_vpu.so.1: cannot open shared object file: No such file or directory ZE_LOADER_DEBUG_TRACE:Load Library of libze_tracing_layer.so.1 failed with libze_tracing_layer.so.1: cannot open shared object file: No such file or directory ZE_LOADER_DEBUG_TRACE:check_drivers(flags=0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) ZE_LOADER_DEBUG_TRACE:init driver libze_intel_gpu.so.1 zeInit(0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) returning ZE_RESULT_SUCCESS
Here is my output:
ZE_ENABLE_LOADER_DEBUG_TRACE=1 /tmp/julia-1.10.2/bin/julia -e "using oneAPI" ZE_LOADER_DEBUG_TRACE:Loading Driver libze_intel_gpu.so.1 ZE_LOADER_DEBUG_TRACE:Loading Driver libze_intel_vpu.so.1 ZE_LOADER_DEBUG_TRACE:Load Library of libze_intel_vpu.so.1 failed with libze_intel_vpu.so.1: cannot open shared object file: No such file or directory ZE_LOADER_DEBUG_TRACE:Load Library of libze_tracing_layer.so.1 failed with libze_tracing_layer.so.1: cannot open shared object file: No such file or directo ry ZE_LOADER_DEBUG_TRACE:check_drivers(flags=0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) ZE_LOADER_DEBUG_TRACE:init driver libze_intel_gpu.so.1 zeInit(0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) returning ZE_RESULT_ERROR_UNINITIALIZED ZE_LOADER_DEBUG_TRACE:Check Drivers Failed on libze_intel_gpu.so.1 , driver will be removed. zeInit failed with ZE_RESULT_ERROR_UNINITIALIZED ┌ Error: Failed to initialize oneAPI │ exception = │ ZeError: driver is not initialized (code 2013265921, ZE_RESULT_ERROR_UNINITIALIZED) │ Stacktrace: │ [1] throw_api_error(res::oneAPI.oneL0._ze_result_t) │ @ oneAPI.oneL0 ~/.julia/packages/oneAPI/2gxUb/lib/level-zero/libze.jl:8 │ [2] check │ @ ~/.julia/packages/oneAPI/2gxUb/lib/level-zero/libze.jl:19 [inlined] │ [3] zeInit │ @ ~/.julia/packages/oneAPI/2gxUb/lib/utils/call.jl:24 [inlined] │ [4] init() │ @ oneAPI.oneL0 ~/.julia/packages/oneAPI/2gxUb/lib/level-zero/oneL0.jl:100 │ [5] run_module_init(mod::Module, i::Int64) │ @ Base ./loading.jl:1134 │ [6] register_restored_modules(sv::Core.SimpleVector, pkg::Base.PkgId, path::String) │ @ Base ./loading.jl:1122 │ [7] _include_from_serialized(pkg::Base.PkgId, path::String, ocachepath::String, depmods::Vector{Any}) │ @ Base ./loading.jl:1067 │ [8] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String, build_id::UInt128) │ @ Base ./loading.jl:1581 │ [9] _require(pkg::Base.PkgId, env::String) │ @ Base ./loading.jl:1938 │ [10] __require_prelocked(uuidkey::Base.PkgId, env::String) │ @ Base ./loading.jl:1812 │ [11] #invoke_in_world#3 │ @ ./essentials.jl:926 [inlined] │ [12] invoke_in_world │ @ ./essentials.jl:923 [inlined] │ [13] _require_prelocked(uuidkey::Base.PkgId, env::String) │ @ Base ./loading.jl:1803 │ [14] macro expansion │ @ ./loading.jl:1790 [inlined] │ [15] macro expansion │ @ ./lock.jl:267 [inlined] │ [16] __require(into::Module, mod::Symbol) │ @ Base ./loading.jl:1753 │ [17] #invoke_in_world#3 │ @ ./essentials.jl:926 [inlined] │ [18] invoke_in_world │ @ ./essentials.jl:923 [inlined] │ [19] require(into::Module, mod::Symbol) │ @ Base ./loading.jl:1746 │ [20] eval │ @ ./boot.jl:385 [inlined] │ [21] exec_options(opts::Base.JLOptions) │ @ Base ./client.jl:291 │ [22] _start() │ @ Base ./client.jl:552 └ @ oneAPI.oneL0 ~/.julia/packages/oneAPI/2gxUb/lib/level-zero/oneL0.jl:103
Maybe the two lines:
ZE_LOADER_DEBUG_TRACE:init driver libze_intel_gpu.so.1 zeInit(0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) returning ZE_RESULT_ERROR_UNINITIALIZED ZE_LOADER_DEBUG_TRACE:Check Drivers Failed on libze_intel_gpu.so.1 , driver will be removed. zeInit failed with ZE_RESULT_ERROR_UNINITIALIZED
provide hint about where to look at for a possible solution.
Maybe the two lines: ZE_LOADER_DEBUG_TRACE:init driver libze_intel_gpu.so.1 zeInit(0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) returning ZE_RESULT_ERROR_UNINITIALIZED ZE_LOADER_DEBUG_TRACE:Check Drivers Failed on libze_intel_gpu.so.1 , driver will be removed. zeInit failed with ZE_RESULT_ERROR_UNINITIALIZED
It does look like the issue is with the compute-runtime, providing libze_intel_gpu. Could you try running with NEOReadDebugKeys=1 PrintDebugMessages=1 PrintXeLogs=1
? I'm not too familiar with compute-runtime's inner workings though; maybe @kballeda could suggest what else to try here. If not, I think we'll have to consider filing an issue upstream.
cg
It does look like the issue is with the compute-runtime, providing libze_intel_gpu. Could you try running with NEOReadDebugKeys=1 PrintDebugMessages=1 PrintXeLogs=1?
export NEOReadDebugKeys=1; export PrintDebugMessages=1; export PrintXeLogs=1; export ZE_ENABLE_LOADER_DEBUG_TRACE=1; /tmp/julia-1.10.2/bin/julia -e "using oneAPI"
gives
ZE_LOADER_DEBUG_TRACE:Loading Driver libze_intel_gpu.so.1 ZE_LOADER_DEBUG_TRACE:Loading Driver libze_intel_vpu.so.1 ZE_LOADER_DEBUG_TRACE:Load Library of libze_intel_vpu.so.1 failed with libze_intel_vpu.so.1: cannot open shared object file: No such file or directory ZE_LOADER_DEBUG_TRACE:Load Library of libze_tracing_layer.so.1 failed with libze_tracing_layer.so.1: cannot open shared object file: No such file or directory ZE_LOADER_DEBUG_TRACE:check_drivers(flags=0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) INFO: System Info query failed! WARNING: Failed to request OCL Turbo Boost ZE_LOADER_DEBUG_TRACE:init driver libze_intel_gpu.so.1 zeInit(0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) returning ZE_RESULT_ERROR_UNINITIALIZED ZE_LOADER_DEBUG_TRACE:Check Drivers Failed on libze_intel_gpu.so.1 , driver will be removed. zeInit failed with ZE_RESULT_ERROR_UNINITIALIZED ┌ Error: Failed to initialize oneAPI │ exception = │ ZeError: driver is not initialized (code 2013265921, ZE_RESULT_ERROR_UNINITIALIZED) │ Stacktrace: │ [1] throw_api_error(res::oneAPI.oneL0._ze_result_t) │ @ oneAPI.oneL0 ~/.julia/packages/oneAPI/2gxUb/lib/level-zero/libze.jl:8 │ [2] check │ @ ~/.julia/packages/oneAPI/2gxUb/lib/level-zero/libze.jl:19 [inlined] │ [3] zeInit │ @ ~/.julia/packages/oneAPI/2gxUb/lib/utils/call.jl:24 [inlined] │ [4] init() │ @ oneAPI.oneL0 ~/.julia/packages/oneAPI/2gxUb/lib/level-zero/oneL0.jl:100 │ [5] run_module_init(mod::Module, i::Int64) │ @ Base ./loading.jl:1134 │ [6] register_restored_modules(sv::Core.SimpleVector, pkg::Base.PkgId, path::String) │ @ Base ./loading.jl:1122 │ [7] _include_from_serialized(pkg::Base.PkgId, path::String, ocachepath::String, depmods::Vector{Any}) │ @ Base ./loading.jl:1067 │ [8] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String, build_id::UInt128) │ @ Base ./loading.jl:1581 │ [9] _require(pkg::Base.PkgId, env::String) │ @ Base ./loading.jl:1938 │ [10] __require_prelocked(uuidkey::Base.PkgId, env::String) │ @ Base ./loading.jl:1812 │ [11] #invoke_in_world#3 │ @ ./essentials.jl:926 [inlined] │ [12] invoke_in_world │ @ ./essentials.jl:923 [inlined] │ [13] _require_prelocked(uuidkey::Base.PkgId, env::String) │ @ Base ./loading.jl:1803 │ [14] macro expansion │ @ ./loading.jl:1790 [inlined] │ [15] macro expansion │ @ ./lock.jl:267 [inlined] │ [16] __require(into::Module, mod::Symbol) │ @ Base ./loading.jl:1753 │ [17] #invoke_in_world#3 │ @ ./essentials.jl:926 [inlined] │ [18] invoke_in_world │ @ ./essentials.jl:923 [inlined] │ [19] require(into::Module, mod::Symbol) │ @ Base ./loading.jl:1746 │ [20] eval │ @ ./boot.jl:385 [inlined] │ [21] exec_options(opts::Base.JLOptions) │ @ Base ./client.jl:291 │ [22] _start() │ @ Base ./client.jl:552 └ @ oneAPI.oneL0 ~/.julia/packages/oneAPI/2gxUb/lib/level-zero/oneL0.jl:103
Problem fixed for me after a system update.
Thanks a lot for your help !
When I try to
using oneAPI
(oneAPI v1.4.0) I get the following message.My
versioninfo()
isand the output of my
hwinfo --display
givesOne more,
inxi -Fzm
gives meAny idea ?
Thanks