dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

`test_gpu_monitoring_recent` flaky #5287

Open jrbourbeau opened 3 years ago

jrbourbeau commented 3 years ago

We observed distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_recent fail in this gpuCI build over in https://github.com/dask/distributed/pull/5242

cc @charlesbluca who has experience with this part of the codebase

12:46:45 =================================== FAILURES ===================================

12:46:45 __________________________ test_gpu_monitoring_recent __________________________

12:46:45 

12:46:45 s = <Scheduler: "tcp://127.0.0.1:38311" workers: 0 cores: 0, tasks: 0>

12:46:45 a = <Worker: 'tcp://127.0.0.1:34401', 0, Status.closed, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>

12:46:45 b = <Worker: 'tcp://127.0.0.1:41111', 1, Status.closed, stored: 0, running: 0/2, ready: 0, comm: 0, waiting: 0>

12:46:45 

12:46:45     @gen_cluster()

12:46:45     async def test_gpu_monitoring_recent(s, a, b):

12:46:45         if nvml.device_get_count() < 1:

12:46:45             pytest.skip("No GPUs available")

12:46:45     

12:46:45         h = nvml._pynvml_handles()

12:46:45         res = await s.get_worker_monitor_info(recent=True)

12:46:45     

12:46:45 >       assert (

12:46:45             res[a.address]["range_query"]["gpu_utilization"]

12:46:45             == pynvml.nvmlDeviceGetUtilizationRates(h).gpu

12:46:45         )

12:46:45 E       assert 2 == 0

12:46:45 E         +2

12:46:45 E         -0

12:46:45 

12:46:45 distributed/diagnostics/tests/test_nvml.py:97: AssertionError
charlesbluca commented 3 years ago

I can see why this test would fail occasionally, as (unless I'm missing something) we can't perfectly guarantee that the results of pynvml.nvmlDeviceGetUtilizationRates(h).gpu will match up with the gpu_utilization we get from the monitor at a different point in time.

If we're comfortable with relaxing this test (and potentially also test_gpu_metrics), we could instead check that the results of the monitor query are non-null, i.e. that data is being collected; this is the check that's being done by test_gpu_monitoring_range_query, which seems less failure-prone. Otherwise, we could mark this test as flaky.

jakirkham commented 3 years ago

cc @rjzamora (in case you have ideas here; no worries if not)

jrbourbeau commented 3 years ago

we can't perfectly guarantee that the results of pynvml.nvmlDeviceGetUtilizationRates(h).gpu will match up with the gpu_utilization we get from the monitor at a different point in time

In other tests where we make checks against system metrics we stop the system monitor periodic callback and then manually call monitor.update() to make sure we have full control over when metrics are gathered (e.g. see the linked test below). I've not looked deeply into this test or pynvml, so this approach may or may not be relevant here, but thought it was worth mentioning

https://github.com/dask/distributed/blob/9c92a61ef69df212e56f8bccb380cb872286ab19/distributed/dashboard/tests/test_scheduler_bokeh.py#L493-L512

charlesbluca commented 3 years ago

That might be useful here, I imagine if we have the monitor update around the same time we grab the relevant metric from pynvml then there's a better chance that they'll match up, though I'm not sure if that would necessarily prevent failure here.

Shot in the dark, but I've observed that most of these GPU metrics tend to be a little noisy on the first 1-2 monitor updates before stabilizing, so maybe sleeping for 1 second would resolve this?

pentschev commented 2 years ago

I've just observed this in https://gpuci.gpuopenanalytics.com/job/dask/job/distributed/job/prb/job/distributed-prb/854/CUDA_VER=11.2,LINUX_VER=ubuntu18.04,PYTHON_VER=3.8,RAPIDS_VER=21.12/console , as well as a couple times locally.

Shot in the dark, but I've observed that most of these GPU metrics tend to be a little noisy on the first 1-2 monitor updates before stabilizing, so maybe sleeping for 1 second would resolve this?

Before looking at this thread, I was thinking the same. But perhaps marking it flaky achieves the same result in a seemingly cleaner way? E.g., we can setup retries and delay, as in https://github.com/dask/distributed/blob/1721d62073b695dd318d869c7a138d3cc05e8ae1/distributed/comm/tests/test_ucx_config.py#L83-L85 .

pentschev commented 2 years ago

I opened https://github.com/dask/distributed/pull/5540 to try and address this test by marking it as flaky, as per my previous comment.