UCX cluster shutdown errors when GIL contention monitoring enabled

charlesbluca commented 1 year ago

Describe the issue: When shutting down a UCX cluster with GIL contention monitoring enabled (i.e. gilknocker is installed and distributed.admin.system-monitor.gil.enabled=true), we get some worker errors of the form:

2023-05-01 12:46:05,381 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/home/nfs/charlesb/dev/distributed/main/distributed/comm/ucx.py", line 349, in read
    await self.ep.recv(msg)
  File "/datasets/charlesb/mambaforge/envs/distributed-gpuci-py310/lib/python3.10/site-packages/ucp/core.py", line 725, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
ucp._libs.exceptions.UCXCanceled: <[Recv #183] ep: 0x7fd3f40730c0, tag: 0x42adb052da7b4792, nbytes: 16, type: <class 'numpy.ndarray'>>: 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nfs/charlesb/dev/distributed/main/distributed/worker.py", line 1237, in heartbeat
    response = await retry_operation(
  File "/home/nfs/charlesb/dev/distributed/main/distributed/utils_comm.py", line 434, in retry_operation
    return await retry(
  File "/home/nfs/charlesb/dev/distributed/main/distributed/utils_comm.py", line 413, in retry
    return await coro()
  File "/home/nfs/charlesb/dev/distributed/main/distributed/core.py", line 1269, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/home/nfs/charlesb/dev/distributed/main/distributed/core.py", line 1028, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/home/nfs/charlesb/dev/distributed/main/distributed/utils.py", line 754, in wrapper
    return await func(*args, **kwargs)
  File "/home/nfs/charlesb/dev/distributed/main/distributed/comm/ucx.py", line 367, in read
    raise CommClosedError(
distributed.comm.core.CommClosedError: Connection closed by writer.
Inner exception: UCXCanceled("<[Recv #183] ep: 0x7fd3f40730c0, tag: 0x42adb052da7b4792, nbytes: 16, type: <class 'numpy.ndarray'>>: ")

Minimal Complete Verifiable Example:

In an environment with UCX-Py and gilknocker:

from distributed import LocalCluster

cluster = LocalCluster(protocol="ucx")
cluster.close()

Anything else we need to know?: These worker errors are seemingly the cause of / directly related to errors that cropped up in the UCX tests running on GPU; I've circumvented the tests for now by manually disabling GIL contention monitoring for GPU tests:

https://github.com/dask/distributed/blob/4bd2ba7ec4e4a4669c13dc1e62820191664783f4/continuous_integration/gpuci/build.sh#L26-L27

But ideally would be nice to get to the root of this issue and remove this override.

Environment:

Dask version: 2023.4.1
Python version: 3.10
Operating System: ubuntu
Install method (conda, pip, source): source

jrbourbeau commented 1 year ago

cc @milesgranger for visibility

milesgranger commented 1 year ago

Came to this after the thread in the PR, as mentioned in the comment there, I haven't seen this specifically, and thought the "no python frame" errors were mostly taken care of, so maybe this is entirely separate. Since I have a good dev env for gilknocker, and cuda/GPU locally I think I'm in a decent position to try to debug this; will put it in the todo list. :)

milesgranger commented 1 year ago

@charlesbluca I'm having a little trouble reproducing it. Is there a specific conda environment file I could use? Currently: (and verified gil monitoring is enabled)

ucx                       1.12.0+gd367332      cuda11.2_0    rapidsai
ucx-proc                  1.0.0                       gpu    rapidsai
ucx-py                    0.25.00         py39_gd367332_0    rapidsai
gilknocker                0.4.1                    pypi_0    pypi

charlesbluca commented 1 year ago

That's good to know, I recreated the failures that were happening with the GPU CI environment, so that would be using latest nightly ucx-py:

ucx                       1.14.0               h8c404fb_1    conda-forge
ucx-proc                  1.0.0                       gpu    rapidsai
ucx-py                    0.32.00a        py310_230428_g7861326_5    rapidsai-nightly
gilknocker                0.4.1           py310hcb5633a_2    conda-forge

I will try out older versions of ucx-py to see if this issue is exclusive to the newest batch of nightlies; in the meantime, this is the environment file I used (a patched version of the 3.10 CI environment):

name: dask-distributed
channels:
  - rapidsai
  - rapidsai-nightly
  - conda-forge
  - nvidia
  - defaults
dependencies:
  - python=3.10
  - packaging
  - pip
  - asyncssh
  - bokeh
  - click
  - cloudpickle
  - coverage
  - dask  # overridden by git tip below
  - filesystem-spec  # overridden by git tip below
  - gilknocker>=0.4.0
  - h5py
  - ipykernel <6.22.0  # https://github.com/dask/distributed/issues/7688
  - ipywidgets <8.0.5  # https://github.com/dask/distributed/issues/7688
  - jinja2 >=2.10.3
  - locket >=1.0
  - msgpack-python
  - netcdf4
  - paramiko
  - pre-commit
  - prometheus_client
  - psutil
  - pyarrow>=7
  - pytest
  - pytest-cov
  - pytest-faulthandler
  - pytest-repeat
  - pytest-rerunfailures
  - pytest-timeout
  - requests
  - s3fs  # overridden by git tip below
  - scikit-learn
  - scipy
  - sortedcollections
  - tblib
  - toolz
  - tornado >=6.2
  - zict  # overridden by git tip below
  - zstandard >=0.9.0
  - pip:
      - git+https://github.com/dask/dask
      - git+https://github.com/dask/s3fs
      - git+https://github.com/dask/zict
      - git+https://github.com/fsspec/filesystem_spec
      - keras
  # RAPIDS dependencies
  - cudatoolkit=11.8
  - cudf=23.06
  - numpy>=1.20.1
  - cupy
  - pynvml
  - ucx-proc=*=gpu
  - ucx-py=0.32

EDIT:

Interestingly, I'm able to reproduce this on a more minimal environment with ucx-py 0.25:

name: distributed-ucx
channels:
  - rapidsai
  - conda-forge
dependencies:
  - distributed
  - ucx-py=0.25
  - ucx-proc=*=gpu
  - pynvml
  - gilknocker

Could be possible that the teardown issues are hardware-dependent?

milesgranger commented 1 year ago

I quickly tried both environment files you provided, and neither produced any trouble for me.

Could be possible that the teardown issues are hardware-dependent?

Maybe? I'm on Fedora 38 x86_64, 530 Nvidia driver, cuda 11.8 if that helps. :man_shrugging: Sorry I couldn't be of more help here.

dask / distributed

UCX cluster shutdown errors when GIL contention monitoring enabled #7815