Open charlesbluca opened 1 year ago
cc @milesgranger for visibility
Came to this after the thread in the PR, as mentioned in the comment there, I haven't seen this specifically, and thought the "no python frame" errors were mostly taken care of, so maybe this is entirely separate. Since I have a good dev env for gilknocker, and cuda/GPU locally I think I'm in a decent position to try to debug this; will put it in the todo list. :)
@charlesbluca I'm having a little trouble reproducing it. Is there a specific conda environment file I could use? Currently: (and verified gil monitoring is enabled)
ucx 1.12.0+gd367332 cuda11.2_0 rapidsai
ucx-proc 1.0.0 gpu rapidsai
ucx-py 0.25.00 py39_gd367332_0 rapidsai
gilknocker 0.4.1 pypi_0 pypi
That's good to know, I recreated the failures that were happening with the GPU CI environment, so that would be using latest nightly ucx-py:
ucx 1.14.0 h8c404fb_1 conda-forge
ucx-proc 1.0.0 gpu rapidsai
ucx-py 0.32.00a py310_230428_g7861326_5 rapidsai-nightly
gilknocker 0.4.1 py310hcb5633a_2 conda-forge
I will try out older versions of ucx-py to see if this issue is exclusive to the newest batch of nightlies; in the meantime, this is the environment file I used (a patched version of the 3.10 CI environment):
name: dask-distributed
channels:
- rapidsai
- rapidsai-nightly
- conda-forge
- nvidia
- defaults
dependencies:
- python=3.10
- packaging
- pip
- asyncssh
- bokeh
- click
- cloudpickle
- coverage
- dask # overridden by git tip below
- filesystem-spec # overridden by git tip below
- gilknocker>=0.4.0
- h5py
- ipykernel <6.22.0 # https://github.com/dask/distributed/issues/7688
- ipywidgets <8.0.5 # https://github.com/dask/distributed/issues/7688
- jinja2 >=2.10.3
- locket >=1.0
- msgpack-python
- netcdf4
- paramiko
- pre-commit
- prometheus_client
- psutil
- pyarrow>=7
- pytest
- pytest-cov
- pytest-faulthandler
- pytest-repeat
- pytest-rerunfailures
- pytest-timeout
- requests
- s3fs # overridden by git tip below
- scikit-learn
- scipy
- sortedcollections
- tblib
- toolz
- tornado >=6.2
- zict # overridden by git tip below
- zstandard >=0.9.0
- pip:
- git+https://github.com/dask/dask
- git+https://github.com/dask/s3fs
- git+https://github.com/dask/zict
- git+https://github.com/fsspec/filesystem_spec
- keras
# RAPIDS dependencies
- cudatoolkit=11.8
- cudf=23.06
- numpy>=1.20.1
- cupy
- pynvml
- ucx-proc=*=gpu
- ucx-py=0.32
EDIT:
Interestingly, I'm able to reproduce this on a more minimal environment with ucx-py 0.25:
name: distributed-ucx
channels:
- rapidsai
- conda-forge
dependencies:
- distributed
- ucx-py=0.25
- ucx-proc=*=gpu
- pynvml
- gilknocker
Could be possible that the teardown issues are hardware-dependent?
I quickly tried both environment files you provided, and neither produced any trouble for me.
Could be possible that the teardown issues are hardware-dependent?
Maybe? I'm on Fedora 38 x86_64, 530 Nvidia driver, cuda 11.8 if that helps. :man_shrugging: Sorry I couldn't be of more help here.
Describe the issue: When shutting down a UCX cluster with GIL contention monitoring enabled (i.e.
gilknocker
is installed anddistributed.admin.system-monitor.gil.enabled=true
), we get some worker errors of the form:Minimal Complete Verifiable Example:
In an environment with UCX-Py and gilknocker:
Anything else we need to know?: These worker errors are seemingly the cause of / directly related to errors that cropped up in the UCX tests running on GPU; I've circumvented the tests for now by manually disabling GIL contention monitoring for GPU tests:
https://github.com/dask/distributed/blob/4bd2ba7ec4e4a4669c13dc1e62820191664783f4/continuous_integration/gpuci/build.sh#L26-L27
But ideally would be nice to get to the root of this issue and remove this override.
Environment: