Open ogrisel opened 2 years ago
For information, upon installing dpctl in that env (from the intel channel) I cannot get any device:
$ python -c "import dpctl; print(dpctl.get_devices())"
[]
I also tried to install from the wheel using pip in another empty Python env on the same machine and also get a segfault.
I cannot reproduce the problem on my machine:
$ conda create -y -n dpnp -c intel dpnp
$ conda activate dpnp
$ python -c "import dpnp"
Running on: 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz
DPCtrl SYCL queue used
SYCL kernels link time: 1.67e-07 (sec.)
Math backend version: Intel(R) oneAPI Math Kernel Library Version 2022.0-Product Build 20211112 for Intel(R) 64 architecture applications
$ python -c "import dpnp"
[<dpctl.SyclDevice [backend_type.opencl, device_type.cpu, 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz] at 0x7fc182823c30>]
I can reproduce the same problem on a machine with a 11th Gen Intel CPU.
EDIT: I installed the oneAPI basekit from https://www.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top/installation/install-using-package-managers/apt.html and if I use the embedded python env activated by running . /opt/intel/oneapi/setvars.sh
then the above import works fine without any segfault.
This confirms that there is a packaging issue for dpnp or its dependencies on the intel
channel.
@ogrisel Package dpnp
, as well as dpctl
and numpy-dppy
rely on SYCL RT to discover devices. This attempts to initialize plug-ins, which will perform discovery of drivers.
SYCL guarantees that host device is always available even if all plug-ins failed. Host device is disabled by default in DPC++ runtime for performance reasons (the salient assumption here is that CPU driver is always available).
You can enable host device by setting env. variable SYCL_ENABLE_HOST_DEVICE=1
. Plug-in discovery can be made verbose by setting SYCL_PI_TRACE=1
.
CPU driver is being loaded by the OpenCL plugin for SYCL. To succeed it needs to open ${CONDA_PREFIX}/lib/libintelocl.so
and libtask_executor.so.2021.13.11.0
installed by intel-opencl-rt
conda package, and tbb
libraries (tbb.so.12
and tbbmalloc.so.2
) installed by tbb
conda package.
It may be insightful to inspect LD_DEBUG=libs python -c "import dpctl; print(dpctl.get_devices())"
to identify the failure to load which is the culprit. All these libraries should be loaded from ${CONDA_PREFIX}/lib
.
Loading these from elsewhere may be the problem.
P.S. You do not need to activate the whole of oneAPI to work with DPC++. I only activate the compiler, like so source /opt/intel/oneapi/compiler/latest/env/vars.sh
.
Activation of the compiler, or oneAPI likely modifies LD_LIBRARY_PATH
causing the plug-ins to find the correct library and unblocking the device discovery.
Thanks @oleksandr-pavlyk. Setting SYCL_ENABLE_HOST_DEVICE=1
fixes the segfault:
(dpnp) ogrisel@1337book:~$ SYCL_ENABLE_HOST_DEVICE=1 python -c "import dpnp"
Running on: SYCL host device
DPCtrl SYCL queue used
SYCL kernels link time: 1.89e-07 (sec.)
Math backend version: Intel(R) oneAPI Math Kernel Library Version 2022.0-Product Build 20211112 for Intel(R) 64 architecture applications
(dpnp) ogrisel@1337book:~$ SYCL_ENABLE_HOST_DEVICE=1 python -c "import dpctl; print(dpctl.get_devices())"
[<dpctl.SyclDevice [backend_type.host, device_type.host, SYCL host device] at 0x7faf927d4230>]
Still users should not have to set environment variables to avoid a segfault in a Python program importing a module with the default packaging.
If this problem cannot be fixed by default in libsycl.so.5
, maybe the dpnp module could at least make sure that SYCL_ENABLE_HOST_DEVICE
is set to avoid the segfault?
Or at least raise a RuntimeError
with a helpful error message that asks the user to set the SYCL_ENABLE_HOST_DEVICE
variable by themselves.
Also, this laptop has an Iris graphics card:
$ lspci | grep Iris
00:02.0 VGA compatible controller: Intel Corporation TigerLake-LP GT2 [Iris Xe Graphics] (rev 01
why doesn't it show up in the list of SYCL devices returned by dpctl
?
You need to install drivers on your machine. Here are instructions for 20.04: https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-focal.html
Perhaps you could try adapting it by changing 'focal' to 'impish'.
Thanks that did it.
I used the focal
repo because there is no repo for impish.
Actually the error message before the segfault points to:
which among other many things has a link to the drivers installation page.
But it would be nice to have more tailored error message (and raised by a catchable RuntimeError
instead of a segfault):
For information here is the output of LD_DEBUG="libs" python -c "import dpnp"
on another Intel machine without the drivers (and probably no GPU either):
@ogrisel Thank you for sharing the output. It seems to point out to OpenCL loader not knowing where to find CPU driver. Could you please try the following
$ OCL_ICD_FILENAMES=libintelocl_emu.so:libalteracl.so:libintelocl.so python -c "import dpctl; print(dpctl.select_cpu_device())"
My output:
$ OCL_ICD_FILENAMES=libintelocl_emu.so:libalteracl.so:libintelocl.so python -c "import dpctl; print(dpctl.select_cpu_device())"
<dpctl.SyclDevice [backend_type.opencl, device_type.cpu, Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz] at 0x7f35f8ccddb0>
Indeed:
(dpnp) ogrisel@drago4:~$ OCL_ICD_FILENAMES=libintelocl_emu.so:libalteracl.so:libintelocl.so python -c "import dpctl; print(dpctl.select_cpu_device())"
<dpctl.SyclDevice [backend_type.opencl, device_type.cpu, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz] at 0x7fe74da727f0>
while without it:
(dpnp) ogrisel@drago4:~$ python -c "import dpctl; print(dpctl.select_cpu_device())"
No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html -1 (CL_DEVICE_NOT_FOUND)
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "dpctl/_sycl_device_factory.pyx", line 323, in dpctl._sycl_device_factory.select_cpu_device
File "dpctl/_sycl_device_factory.pyx", line 338, in dpctl._sycl_device_factory.select_cpu_device
ValueError: Device unavailable.
After an upgrade to Ubuntu 22.04, I can no longer list the GPU device with dpctl
.
I checked an all the packages listed in https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-focal.html are still installed and my user belong to the render
unix group:
$ sudo apt-get install \
intel-opencl-icd \
intel-level-zero-gpu level-zero \
intel-media-va-driver-non-free libmfx1
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
intel-opencl-icd is already the newest version (22.14.22890-1).
libmfx1 is already the newest version (22.3.0-1).
intel-media-va-driver-non-free is already the newest version (22.3.1+ds1-1).
intel-level-zero-gpu is already the newest version (1.2.21786+i643~u20.04).
level-zero is already the newest version (1.6.2+i643~u20.04).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
Is there any diagnostic tool to run see if the driver is properly loaded? I tried to look through the dmesg
log and I did not see anything special but I am not sure what to look for.
Try checking lspci -nnk | grep i915
. My output:
$ lspci -nnk | grep i915
Kernel driver in use: i915
Kernel modules: i915
I get the same output as you do with the lspci
command.
I got another system update and reboot in the mean time and here are the new versions of the APT packages:
intel-opencl-icd is already the newest version (22.14.22890-1).
libmfx1 is already the newest version (22.3.0-1).
intel-media-va-driver-non-free is already the newest version (22.3.1+ds1-1).
intel-level-zero-gpu is already the newest version (1.3.22597+i699~u20.04).
level-zero is already the newest version (1.7.9+i699~u20.04).
I also updated the versions of the packages in the conda env:
# packages in environment at /home/ogrisel/mambaforge/envs/dpnp:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_gnu conda-forge
black 22.3.0 pyhd8ed1ab_0 conda-forge
bzip2 1.0.8 hb9a14ef_9 intel
ca-certificates 2022.3.18 h06a4308_0 intel
certifi 2021.10.8 py39hf3d152e_2 conda-forge
click 8.1.2 py39hf3d152e_0 conda-forge
dataclasses 0.8 pyhc8e2a94_3 conda-forge
dpcpp-cpp-rt 2022.1.0 intel_3768 intel
dpcpp_cpp_rt 2022.0.1 intel_3633 intel
dpctl 0.12.0 py39h6461980_0 intel
dpnp 0.10.0 py39hafff6e5_0 intel
filprofiler 2022.01.1 py39h2551b06_0 conda-forge
icc_rt 2022.1.0 intel_3768 intel
intel-cmplr-lib-rt 2022.1.0 intel_3768 intel
intel-cmplr-lic-rt 2022.1.0 intel_3768 intel
intel-opencl-rt 2022.1.0 intel_3768 intel
intel-openmp 2022.1.0 intel_3768 intel
intelpython 2022.0.0 0 intel
ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 11.2.0 h1d223b6_15 conda-forge
libgomp 11.2.0 h1d223b6_15 conda-forge
libstdcxx-ng 9.3.0 hdf63c60_101 intel
mkl 2022.1.0 intel_223 intel
mkl-dpcpp 2022.1.0 intel_223 intel
mkl-service 2.4.0 py39h4119f30_10 intel
mkl_fft 1.3.1 py39h8344fd8_7 intel
mkl_random 1.2.2 py39h4ac99d2_7 intel
mkl_umath 0.1.1 py39h03fa629_17 intel
mypy_extensions 0.4.3 py39hf3d152e_5 conda-forge
ncurses 6.3 h27087fc_1 conda-forge
numpy 1.21.2 py39hec4e512_7 intel
numpy-base 1.21.2 py39h40791c5_7 intel
openssl 3.0.3 h166bdaf_0 conda-forge
pathspec 0.9.0 pyhd8ed1ab_0 conda-forge
pip 21.2.4 py39h06a4308_0 intel
platformdirs 2.5.1 pyhd8ed1ab_0 conda-forge
python 3.9.7 hf930737_3_cpython conda-forge
python_abi 3.9 2_cp39 conda-forge
readline 8.1 h46c0cb4_0 conda-forge
setuptools 58.0.4 py39h06a4308_0 intel
six 1.16.0 pyhd3eb1b0_0 intel
sqlite 3.36.0 hb9a14ef_3 intel
tbb 2021.5.0 intel_707 intel
tbb4py 2021.5.0 py39_intel_707 intel
threadpoolctl 3.1.0 pyh8a188c0_0 conda-forge
tk 8.6.11 h27826a3_1 conda-forge
tomli 2.0.1 pyhd8ed1ab_0 conda-forge
typed-ast 1.5.3 py39hb9d737c_0 conda-forge
typing_extensions 4.2.0 pyha770c72_1 conda-forge
tzdata 2022a h191b570_0 conda-forge
wheel 0.37.0 pyhd3eb1b0_1 intel
xz 5.2.5 h74280d8_2 intel
zlib 1.2.11.1 h1e99aa7_5 intel
and now I no longer get the segfault when importing dpnp
in this env. However I now get a segfault when listing the devices with dpctl
:
$ python -c "import dpctl; print(dpctl.select_cpu_device())"
Abort was called at 39 line in file:
/opt/src/l0_gpu_driver/shared/source/gmm_helper/client_context/gmm_client_context.cpp
Aborted (core dumped)
$ python -c "import dpctl; print(dpctl.get_devices())"
Abort was called at 39 line in file:
/opt/src/l0_gpu_driver/shared/source/gmm_helper/client_context/gmm_client_context.cpp
Aborted (core dumped)
Here a gdb backtrace for the last command:
Olivier, you need to upgrade level-zero from 1.7.9 to 1.7.15. Please get debian files from https://github.com/oneapi-src/level-zero/releases/tag/v1.7.15 and install them using sudo dpkg -i level-zero_1.7.15+u18.04_amd64.deb level-zero-devel_1.7.15+u18.04_amd64.deb
.
We experienced a similar issue and upgrade resolved it.
Thanks, however that does not seem to work in my case:
(dpnp) ogrisel@1337book:~$ dpkg -l level-zero-devel
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-================-============-============-=================================
ii level-zero-devel 1.7.15 amd64 oneAPI Level Zero
(dpnp) ogrisel@1337book:~$ dpkg -l level-zero
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==============-============-============-=================================
ii level-zero 1.7.15 amd64 oneAPI Level Zero
(dpnp) ogrisel@1337book:~$ python -c "import dpctl; print(dpctl.get_devices())"
Abort was called at 39 line in file:
/opt/src/l0_gpu_driver/shared/source/gmm_helper/client_context/gmm_client_context.cpp
Aborted (core dumped)
@ogrisel Please make sure that level-zero implementation for Intel GPU is also updated:
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-====================-============-============-========================================================
ii intel-level-zero-gpu 1.3.23063 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
You can get the compute runtime debian packages from https://github.com/intel/compute-runtime
Indeed the version installed with apt is older:
intel-level-zero-gpu is already the newest version (1.3.22597+i699~u20.04).
I manually wget'ed the deb files as described in:
https://github.com/intel/compute-runtime/releases/tag/22.18.23063
and now it works for me, thanks!
I think this issue should still stay open as long as a bad driver install causes an import of dpnp or a call to dpctl.get_devices()
causes the Python program to be killed with a segfault though.
Prior to installing the latest intel-level-zero-gpu
, I could only get the segfault from dpctl.get_devices()
. Shall I open an issue in the dpctl repo and close this one?
@ogrisel, there are conda packages from conda-forge. conda install intel-compute-runtime
@isuruf is intel-compute-runtime
supposed to include GPU support (similarly to the APT package named intel-level-zero-gpu
)?
I uninstalled intel-level-zero-gpu
and level-zero
from APT and created a new conda env with dpnp and dpctl from the intel channel and intel-compute-runtime
from conda-forge but I cannot use the GPU:
$ python -c "import dpctl; print(dpctl.select_gpu_device())"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "dpctl/_sycl_device_factory.pyx", line 363, in dpctl._sycl_device_factory.select_gpu_device
File "dpctl/_sycl_device_factory.pyx", line 378, in dpctl._sycl_device_factory.select_gpu_device
ValueError: Device unavailable.
I checked that my GPU driver was still active with:
$ lspci -nnk | grep i915
Kernel driver in use: i915
Kernel modules: i915
and the dmesg
section on i915 looks good:
[ 1.959042] i915 0000:00:02.0: [drm] VT-d active for gfx access
[ 1.959045] checking generic (4000000000 7e9000) vs hw (603e000000 1000000)
[ 1.959047] checking generic (4000000000 7e9000) vs hw (4000000000 10000000)
[ 1.959048] fb0: switching to i915 from EFI VGA
[ 1.959113] Console: switching to colour dummy device 80x25
[ 1.959152] i915 0000:00:02.0: vgaarb: deactivate vga console
[ 1.959618] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
[ 1.960380] i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/tgl_dmc_ver2_12.bin (v2.12)
[ 2.077012] usb 3-2: New USB device found, idVendor=0408, idProduct=5349, bcdDevice= 0.06
[ 2.077015] usb 3-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 2.077016] usb 3-2: Product: HP HD Camera
[ 2.077017] usb 3-2: Manufacturer: Quanta
[ 2.077018] usb 3-2: SerialNumber: 01.00.00
[ 2.089216] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
[ 2.091680] ACPI: video: Video Device [GFX0] (multi-head: yes rom: no post: no)
[ 2.092194] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input13
[ 2.107503] fbcon: i915drmfb (fb0) is primary device
[ 2.121448] Console: switching to colour frame buffer device 240x67
[ 2.149639] i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
[ 2.220360] usb 3-7: new full-speed USB device number 3 using xhci_hcd
At least, now dpctl
raises an exception instead of crashing Python.
@ogrisel, there are conda packages from conda-forge. conda install intel-compute-runtime
For reference, it seems that this package is lagging by several releases:
https://github.com/conda-forge/intel-compute-runtime-feedstock/pulls
But for the record I think it would be great to be able to install those runtime libraries via conda-forge. That would definitely help with adoption of SYCL-optimized libraries on Intel hardware in the Python ecosystem.
Steps to reproduce:
This machine is an Intel machine:
So I think the hardware requirements should be fulfilled and the software requirements should automatically be installed by the conda dependency solver and the
intel
channel. Here is the conda env:Here is a gdb backtrace of the same import statement: