Closed hovo1990 closed 1 year ago
I haven't seen this kind of error before when CUDA is around. Is CUDA correctly being brought into the docker container (can you run nvidia-smi in the container)?
cc @charlesbluca in case you have thoughts here
Nothing stands out to me here - looking at nvmlDeviceGetName
, it returns the value
of a ctypes.c_char_Array_*
, which from my limited understanding of ctypes should always be bytes (cc @rjzamora in case this isn't always the case)
I'd be interested in seeing what the return value of this query is with a minimal reproducer:
from pynvml import *
nvmlInit()
h = nvmlDeviceGetHandleByIndex(0)
nvmlDeviceGetName(h)
Also noticed that pynvml
wasn't listed in your environment, are you able to import it and check the version?
I got hit by this yesterday whilst preparing to teach one of my Higher Performance Python courses, it looked like the latest Dask was at fault until I realised I had a strange dependency issue via a second library (scalene
) which installs nvidia-ml-py
which contains pynvml
.
Reproducible example:
# use conda to install a fresh environment, Python 3.9, current dask but _no_ scalene
$ ipython
In [1]: import dask.distributed # version 2022.03.0
In [2]: dask.distributed.Client() # runs with success
Out[2]: <Client: 'tcp://127.0.0.1:33537' processes=4 threads=8, memory=31.12 GiB>
# Now install `pynvml` to get the bug...
$ pip install scalene # sidenote can be replaced just with `pip install nvidia_ml_py` to generate same bug, see below
...
Successfully installed commonmark-0.9.1 nvidia-ml-py-11.515.0 rich-12.0.1 scalene-1.5.5
$ ipython
In [1]: import dask.distributed
In [2]: dask.distributed.Client()
...
File ~/miniconda3/envs/course3/lib/python3.9/site-packages/distributed/diagnostics/nvml.py:123, in _get_name(h)
121 def _get_name(h):
122 try:
--> 123 return pynvml.nvmlDeviceGetName(h).decode()
124 except pynvml.NVMLError_NotSupported:
125 return None
https://github.com/plasma-umass/scalene is a relatively new combined cpu+memory profiler (mac/linux only) and a more recent addition has been to profile GPUs as well as CPUs. https://github.com/plasma-umass/scalene/issues/378 notes that pynvml
is not optional but could be made so (and I'd agree).
Whilst the bug is not directly with Dask, the fact that Dask uses pynvml
(somehow, I've not dug) if it is installed and then fails feels brittle. I don't know how else one gets pynvml
installed. I can confirm that if I don't install scalene
and only $ pip install nvidia_ml_py
then the above bug is easily reproduced.
I'm working on Linux (Mint 20.3) via conda
with fresh Python 3.9 installation of standard data science tools (Dask, Pandas, numpy etc).
+1 I had the exact same problem as @ianozsvald
The Dask + Scalene issue has been fixed by replacing the nvidia-ml-py
dependency with pynvml
(https://github.com/plasma-umass/scalene/issues/378).
I'm currently running into this w/ the dask-cuda nightly. Environment was created with this:
mamba create --name new_env python=3.10
conda activate new_env
mamba install -c conda-forge -c nvidia -c rapidsai-nightly dask-cuda=23.04* cuml=23.04*
2023-02-17 17:18:32,713 - distributed.deploy.spec - WARNING - Cluster closed without starting up
Traceback (most recent call last):
File "/home/cnolet/miniconda3/envs/cuml_2304_021623/lib/python3.10/site-packages/distributed/deploy/spec.py", line 319, in _start
self.scheduler = cls(**self.scheduler_spec.get("options", {}))
File "/home/cnolet/miniconda3/envs/cuml_2304_021623/lib/python3.10/site-packages/distributed/scheduler.py", line 3662, in __init__
ServerNode.__init__(
File "/home/cnolet/miniconda3/envs/cuml_2304_021623/lib/python3.10/site-packages/distributed/core.py", line 348, in __init__
self.monitor = SystemMonitor()
File "/home/cnolet/miniconda3/envs/cuml_2304_021623/lib/python3.10/site-packages/distributed/system_monitor.py", line 96, in __init__
gpu_extra = nvml.one_time()
File "/home/cnolet/miniconda3/envs/cuml_2304_021623/lib/python3.10/site-packages/distributed/diagnostics/nvml.py", line 336, in one_time
"name": _get_name(h),
File "/home/cnolet/miniconda3/envs/cuml_2304_021623/lib/python3.10/site-packages/distributed/diagnostics/nvml.py", line 319, in _get_name
return pynvml.nvmlDeviceGetName(h).decode()
AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?
# packages in environment at /home/cnolet/miniconda3/envs/cuml_2304_021623:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
appdirs 1.4.4 pyh9f0ad1d_0 conda-forge
arrow-cpp 10.0.1 ha770c72_8_cpu conda-forge
aws-c-auth 0.6.23 h7c1ec98_1 conda-forge
aws-c-cal 0.5.20 ha1c5a7c_4 conda-forge
aws-c-common 0.8.9 h0b41bf4_0 conda-forge
aws-c-compression 0.2.16 h1afc718_1 conda-forge
aws-c-event-stream 0.2.18 h6620826_2 conda-forge
aws-c-http 0.7.3 h33879ea_1 conda-forge
aws-c-io 0.13.14 hf82dcb6_3 conda-forge
aws-c-mqtt 0.8.6 hdd1a3fa_1 conda-forge
aws-c-s3 0.2.3 h5f5417b_3 conda-forge
aws-c-sdkutils 0.1.7 h1afc718_1 conda-forge
aws-checksums 0.1.14 h1afc718_1 conda-forge
aws-crt-cpp 0.18.16 hf9eb7b6_13 conda-forge
aws-sdk-cpp 1.10.57 h063c87b_2 conda-forge
bokeh 2.4.3 pyhd8ed1ab_3 conda-forge
brotlipy 0.7.0 py310h5764c6d_1005 conda-forge
bzip2 1.0.8 h7b6447c_0
c-ares 1.18.1 h7f98852_0 conda-forge
ca-certificates 2023.01.10 h06a4308_0
cachetools 5.3.0 pyhd8ed1ab_0 conda-forge
certifi 2022.12.7 py310h06a4308_0
cffi 1.15.1 py310h255011f_3 conda-forge
charset-normalizer 2.1.1 pyhd8ed1ab_0 conda-forge
click 8.1.3 unix_pyhd8ed1ab_2 conda-forge
cloudpickle 2.2.1 pyhd8ed1ab_0 conda-forge
cryptography 39.0.1 py310h34c0648_0 conda-forge
cubinlinker 0.2.0 py310hf09951c_1 rapidsai-nightly
cuda-profiler-api 11.8.86 0 nvidia
cuda-python 11.8.1 py310h01a121a_2 conda-forge
cudatoolkit 11.8.0 h37601d7_11 conda-forge
cudf 23.04.00a cuda_11_py310_230216_g4e32bfe3aa_99 rapidsai-nightly
cuml 23.04.00a cuda11_py310_230216_g3bab4d1f7_71 rapidsai-nightly
cupy 11.5.0 py310h9216885_0 conda-forge
cytoolz 0.12.0 py310h5764c6d_1 conda-forge
dask 2023.2.0 pyhd8ed1ab_0 conda-forge
dask-core 2023.2.0 pyhd8ed1ab_0 conda-forge
dask-cuda 23.04.00a py310_230215_g8134e6b_25 rapidsai-nightly
dask-cudf 23.04.00a cuda_11_py310_230216_g4e32bfe3aa_99 rapidsai-nightly
distributed 2023.2.0 pyhd8ed1ab_0 conda-forge
dlpack 0.5 h9c3ff4c_0 conda-forge
faiss-proc 1.0.0 cuda conda-forge
fastavro 1.7.1 py310h1fa729e_0 conda-forge
fastrlock 0.8 py310hd8f1fbe_3 conda-forge
fmt 9.1.0 h924138e_0 conda-forge
freetype 2.12.1 hca18f0e_1 conda-forge
fsspec 2023.1.0 pyhd8ed1ab_0 conda-forge
gflags 2.2.2 he1b5a44_1004 conda-forge
glog 0.6.0 h6f12383_0 conda-forge
heapdict 1.0.1 py_0 conda-forge
idna 3.4 pyhd8ed1ab_0 conda-forge
jinja2 3.1.2 pyhd8ed1ab_1 conda-forge
joblib 1.2.0 pyhd8ed1ab_0 conda-forge
jpeg 9e h0b41bf4_3 conda-forge
keyutils 1.6.1 h166bdaf_0 conda-forge
krb5 1.20.1 h81ceb04_0 conda-forge
lcms2 2.14 hfd0df8a_1 conda-forge
ld_impl_linux-64 2.38 h1181459_1
lerc 4.0.0 h27087fc_0 conda-forge
libabseil 20220623.0 cxx17_h05df665_6 conda-forge
libarrow 10.0.1 h2c3b227_8_cpu conda-forge
libblas 3.9.0 16_linux64_openblas conda-forge
libbrotlicommon 1.0.9 h166bdaf_8 conda-forge
libbrotlidec 1.0.9 h166bdaf_8 conda-forge
libbrotlienc 1.0.9 h166bdaf_8 conda-forge
libcblas 3.9.0 16_linux64_openblas conda-forge
libcrc32c 1.1.2 h9c3ff4c_0 conda-forge
libcublas 11.11.3.6 0 nvidia
libcublas-dev 11.11.3.6 0 nvidia
libcudf 23.04.00a cuda11_230216_g4e32bfe3aa_99 rapidsai-nightly
libcufft 10.9.0.58 0 nvidia
libcuml 23.04.00a cuda11_230216_g3bab4d1f7_71 rapidsai-nightly
libcumlprims 23.04.00a cuda11_230208_gc3bf2c8_4 rapidsai-nightly
libcurand 10.3.0.86 0 nvidia
libcurand-dev 10.3.0.86 0 nvidia
libcurl 7.88.0 hdc1c0ab_0 conda-forge
libcusolver 11.4.1.48 0 nvidia
libcusolver-dev 11.4.1.48 0 nvidia
libcusparse 11.7.5.86 0 nvidia
libcusparse-dev 11.7.5.86 0 nvidia
libdeflate 1.17 h0b41bf4_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 h516909a_1 conda-forge
libevent 2.1.10 h28343ad_4 conda-forge
libfaiss 1.7.2 cuda112hb18a002_3_cuda conda-forge
libffi 3.4.2 h6a678d5_6
libgcc-ng 12.2.0 h65d4601_19 conda-forge
libgfortran-ng 12.2.0 h69a702a_19 conda-forge
libgfortran5 12.2.0 h337968e_19 conda-forge
libgomp 12.2.0 h65d4601_19 conda-forge
libgoogle-cloud 2.7.0 h21dfe5b_1 conda-forge
libgrpc 1.51.1 h4fad500_1 conda-forge
liblapack 3.9.0 16_linux64_openblas conda-forge
libllvm11 11.1.0 he0ac6c6_5 conda-forge
libnghttp2 1.51.0 hff17c54_0 conda-forge
libnsl 2.0.0 h7f98852_0 conda-forge
libopenblas 0.3.21 pthreads_h78a6416_3 conda-forge
libpng 1.6.39 h753d276_0 conda-forge
libprotobuf 3.21.12 h3eb15da_0 conda-forge
libraft-distance 23.04.00a cuda11_230216_ge14ec63a_64 rapidsai-nightly
libraft-headers 23.04.00a cuda11_230216_ge14ec63a_64 rapidsai-nightly
libraft-nn 23.04.00a cuda11_230216_ge14ec63a_64 rapidsai-nightly
librmm 23.04.00a cuda11_230216_g82e184fe_16 rapidsai-nightly
libsqlite 3.40.0 h753d276_0 conda-forge
libssh2 1.10.0 hf14f497_3 conda-forge
libstdcxx-ng 12.2.0 h46fd767_19 conda-forge
libthrift 0.16.0 he500d00_2 conda-forge
libtiff 4.5.0 h6adf6a1_2 conda-forge
libutf8proc 2.8.0 h166bdaf_0 conda-forge
libuuid 2.32.1 h7f98852_1000 conda-forge
libwebp-base 1.2.4 h166bdaf_0 conda-forge
libxcb 1.13 h7f98852_1004 conda-forge
libzlib 1.2.13 h166bdaf_4 conda-forge
llvmlite 0.39.1 py310h58363a5_1 conda-forge
locket 1.0.0 pyhd8ed1ab_0 conda-forge
lz4 4.3.2 py310h0cfdcf0_0 conda-forge
lz4-c 1.9.4 hcb278e6_0 conda-forge
markupsafe 2.1.2 py310h1fa729e_0 conda-forge
msgpack-python 1.0.4 py310hbf28c38_1 conda-forge
nccl 2.14.3.1 h0800d71_0 conda-forge
ncurses 6.4 h6a678d5_0
numba 0.56.4 py310ha5257ce_0 conda-forge
numpy 1.23.5 py310h53a5b5f_0 conda-forge
nvtx 0.2.3 py310h5764c6d_2 conda-forge
openjpeg 2.5.0 hfec8fc6_2 conda-forge
openssl 3.0.8 h0b41bf4_0 conda-forge
orc 1.8.2 hfdbbad2_2 conda-forge
packaging 23.0 pyhd8ed1ab_0 conda-forge
pandas 1.5.3 py310h9b08913_0 conda-forge
parquet-cpp 1.5.1 2 conda-forge
partd 1.3.0 pyhd8ed1ab_0 conda-forge
pillow 9.4.0 py310h023d228_1 conda-forge
pip 22.3.1 py310h06a4308_0
pooch 1.6.0 pyhd8ed1ab_0 conda-forge
protobuf 4.21.12 py310heca2aa9_0 conda-forge
psutil 5.9.4 py310h5764c6d_0 conda-forge
pthread-stubs 0.4 h36c2ea0_1001 conda-forge
ptxcompiler 0.7.0 py310h01a121a_3 conda-forge
pyarrow 10.0.1 py310h633f555_8_cpu conda-forge
pycparser 2.21 pyhd8ed1ab_0 conda-forge
pylibraft 23.04.00a cuda11_py310_230216_ge14ec63a_64 rapidsai-nightly
pynvml 11.5.0 pyhd8ed1ab_0 conda-forge
pyopenssl 23.0.0 pyhd8ed1ab_0 conda-forge
pysocks 1.7.1 pyha2e5f31_6 conda-forge
python 3.10.9 he550d4f_0_cpython conda-forge
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python_abi 3.10 3_cp310 conda-forge
pytz 2022.7.1 pyhd8ed1ab_0 conda-forge
pyyaml 6.0 py310h5764c6d_5 conda-forge
raft-dask 23.04.00a cuda11_py310_230216_ge14ec63a_64 rapidsai-nightly
re2 2023.02.01 hcb278e6_0 conda-forge
readline 8.2 h5eee18b_0
requests 2.28.2 pyhd8ed1ab_0 conda-forge
rmm 23.04.00a cuda11_py310_230216_g82e184fe_16 rapidsai-nightly
s2n 1.3.35 h3358134_0 conda-forge
scipy 1.10.0 py310h8deb116_2 conda-forge
setuptools 65.6.3 py310h06a4308_0
six 1.16.0 pyh6c4a22f_0 conda-forge
snappy 1.1.9 hbd366e4_2 conda-forge
sortedcontainers 2.4.0 pyhd8ed1ab_0 conda-forge
spdlog 1.11.0 h9b3ece8_1 conda-forge
sqlite 3.40.1 h5082296_0
tblib 1.7.0 pyhd8ed1ab_0 conda-forge
tk 8.6.12 h1ccaba5_0
toolz 0.12.0 pyhd8ed1ab_0 conda-forge
tornado 6.2 py310h5764c6d_1 conda-forge
treelite 3.1.0 py310h168469b_0 conda-forge
treelite-runtime 3.1.0 pypi_0 pypi
typing_extensions 4.4.0 pyha770c72_0 conda-forge
tzdata 2022g h04d1e81_0
ucx 1.13.1 h538f049_1 conda-forge
ucx-proc 1.0.0 gpu conda-forge
ucx-py 0.31.00a230203 py310_g3806c64_4 rapidsai-nightly
urllib3 1.26.14 pyhd8ed1ab_0 conda-forge
wheel 0.37.1 pyhd3eb1b0_0
xorg-libxau 1.0.9 h7f98852_0 conda-forge
xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge
xz 5.2.10 h5eee18b_1
yaml 0.2.5 h7f98852_2 conda-forge
zict 2.2.0 pyhd8ed1ab_0 conda-forge
zlib 1.2.13 h166bdaf_4 conda-forge
zstd 1.5.2 h3eb15da_6 conda-forge
I get this error when trying to start the cluster with LocalCUDACluster
Following up my previous reply, it looks like downgrading to pynvml 11.4.1
works. I'll do that for now.
This was fixed in https://github.com/dask/distributed/pull/7544 but I think for that you need a distributed nightly as well.
So mamba install -c conda-forge -c nvidia -c rapidsai-nightly -c dask/label/dev dask-cuda=23.04*
. I am not sufficiently au fait with conda to know how to specify this directly as a dep in dask-cuda.
Thanks for following up here @cjnolet @wence-. Going to close this issue via https://github.com/dask/distributed/pull/7544 -- let me know if it should be re-opened
What happened:
When running "dask-scheduler" from inside a docker container, I get the following error:
Environment: