Open jrbourbeau opened 3 years ago
I can see why this test would fail occasionally, as (unless I'm missing something) we can't perfectly guarantee that the results of pynvml.nvmlDeviceGetUtilizationRates(h).gpu
will match up with the gpu_utilization
we get from the monitor at a different point in time.
If we're comfortable with relaxing this test (and potentially also test_gpu_metrics
), we could instead check that the results of the monitor query are non-null, i.e. that data is being collected; this is the check that's being done by test_gpu_monitoring_range_query
, which seems less failure-prone. Otherwise, we could mark this test as flaky
.
cc @rjzamora (in case you have ideas here; no worries if not)
we can't perfectly guarantee that the results of pynvml.nvmlDeviceGetUtilizationRates(h).gpu will match up with the gpu_utilization we get from the monitor at a different point in time
In other tests where we make checks against system metrics we stop the system monitor periodic callback and then manually call monitor.update()
to make sure we have full control over when metrics are gathered (e.g. see the linked test below). I've not looked deeply into this test or pynvml
, so this approach may or may not be relevant here, but thought it was worth mentioning
That might be useful here, I imagine if we have the monitor update around the same time we grab the relevant metric from pynvml then there's a better chance that they'll match up, though I'm not sure if that would necessarily prevent failure here.
Shot in the dark, but I've observed that most of these GPU metrics tend to be a little noisy on the first 1-2 monitor updates before stabilizing, so maybe sleeping for 1 second would resolve this?
I've just observed this in https://gpuci.gpuopenanalytics.com/job/dask/job/distributed/job/prb/job/distributed-prb/854/CUDA_VER=11.2,LINUX_VER=ubuntu18.04,PYTHON_VER=3.8,RAPIDS_VER=21.12/console , as well as a couple times locally.
Shot in the dark, but I've observed that most of these GPU metrics tend to be a little noisy on the first 1-2 monitor updates before stabilizing, so maybe sleeping for 1 second would resolve this?
Before looking at this thread, I was thinking the same. But perhaps marking it flaky
achieves the same result in a seemingly cleaner way? E.g., we can setup retries and delay, as in https://github.com/dask/distributed/blob/1721d62073b695dd318d869c7a138d3cc05e8ae1/distributed/comm/tests/test_ucx_config.py#L83-L85 .
I opened https://github.com/dask/distributed/pull/5540 to try and address this test by marking it as flaky, as per my previous comment.
We observed
distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_recent
fail in this gpuCI build over in https://github.com/dask/distributed/pull/5242cc @charlesbluca who has experience with this part of the codebase