dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 717 forks source link

PyNVML import slowdown? #6247

Open mrocklin opened 2 years ago

mrocklin commented 2 years ago

I was looking through a flaky test report and saw this:

--------------------------- Subprocess stdout/stderr---------------------------
Traceback (most recent call last):
  File "/Users/runner/miniconda3/envs/dask-distributed/bin/dask-worker", line 33, in <module>
    sys.exit(load_entry_point('distributed', 'console_scripts', 'dask-worker')())
  File "/Users/runner/miniconda3/envs/dask-distributed/bin/dask-worker", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.8/importlib/metadata.py", line 77, in load
    module = import_module(match.group('module'))
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/runner/work/distributed/distributed/distributed/__init__.py", line 10, in <module>
    from distributed.actor import Actor, ActorFuture, BaseActorFuture
  File "/Users/runner/work/distributed/distributed/distributed/actor.py", line 14, in <module>
    from distributed.client import Future
  File "/Users/runner/work/distributed/distributed/distributed/client.py", line 54, in <module>
    from distributed import cluster_dump, preloading
  File "/Users/runner/work/distributed/distributed/distributed/preloading.py", line 19, in <module>
    from distributed.core import Server
  File "/Users/runner/work/distributed/distributed/distributed/core.py", line 29, in <module>
    from distributed.comm import (
  File "/Users/runner/work/distributed/distributed/distributed/comm/__init__.py", line 46, in <module>
    _register_transports()
  File "/Users/runner/work/distributed/distributed/distributed/comm/__init__.py", line 41, in _register_transports
    from distributed.comm import ucx
  File "/Users/runner/work/distributed/distributed/distributed/comm/ucx.py", line 28, in <module>
    from distributed.diagnostics.nvml import has_cuda_context
  File "/Users/runner/work/distributed/distributed/distributed/diagnostics/nvml.py", line 7, in <module>
    import pynvml
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.8/site-packages/pynvml/__init__.py", line 1, in <module>
    from .nvml import *
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.8/site-packages/pynvml/nvml.py", line 33, in <module>
    from ctypes.util import find_library
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 914, in _find_spec
  File "<frozen importlib._bootstrap_external>", line 1407, in find_spec
  File "<frozen importlib._bootstrap_external>", line 1376, in _get_spec
  File "<frozen importlib._bootstrap_external>", line 1345, in _path_importer_cache
KeyboardInterrupt

This was in a test for a worker where the worker never came up. This happens in our test suite some times for various reasons. NVML is weird/suspicious enough that I thought I'd raise this to see if this could be an issue at all and maybe a cause for some flakiness. cc @quasiben @jakirkham @pentschev

pentschev commented 2 years ago

Interestingly, the last call in the call stack is from ctypes.util import find_library, which is just used by PyNVML. Has this happened elsewhere or is it the first time? At first glance, it seems just a coincidence that it's happening in PyNVML, but if this is a repeating pattern then there may be some issue there. PyNVML shouldn't be doing anything special at import time, AFAIK even loading shared libraries is delayed until nvmlInit() is called, which is incidentally where we disable NVML diagnostics if CUDA isn't available: https://github.com/dask/distributed/blob/70e1fca41e341f937320a00de3bf70ff8d45d1c7/distributed/diagnostics/nvml.py#L35-L42