Open mrocklin opened 2 years ago
Interestingly, the last call in the call stack is from ctypes.util import find_library
, which is just used by PyNVML. Has this happened elsewhere or is it the first time? At first glance, it seems just a coincidence that it's happening in PyNVML, but if this is a repeating pattern then there may be some issue there. PyNVML shouldn't be doing anything special at import time, AFAIK even loading shared libraries is delayed until nvmlInit()
is called, which is incidentally where we disable NVML diagnostics if CUDA isn't available: https://github.com/dask/distributed/blob/70e1fca41e341f937320a00de3bf70ff8d45d1c7/distributed/diagnostics/nvml.py#L35-L42
I was looking through a flaky test report and saw this:
This was in a test for a worker where the worker never came up. This happens in our test suite some times for various reasons. NVML is weird/suspicious enough that I thought I'd raise this to see if this could be an issue at all and maybe a cause for some flakiness. cc @quasiben @jakirkham @pentschev