Closed B3nihana closed 10 months ago
Did you install the dependencies as per the instructions?
Specifically
# You may also need to install dependencies since those aren't packaged into the wheel
sudo -u dd-agent -H /opt/datadog-agent/embedded/bin/pip3 install grpcio pynvml
Did you install the dependencies as per the instructions?
Specifically
# You may also need to install dependencies since those aren't packaged into the wheel sudo -u dd-agent -H /opt/datadog-agent/embedded/bin/pip3 install grpcio pynvml
I did follow the instructions, but I just checked again on the machines with this issue and running the command you pulled out results in:
Requirement already satisfied: grpcio in /opt/datadog-agent/embedded/lib/python3.9/site-packages (1.59.2
Requirement already satisfied: pynvml in /opt/datadog-agent/embedded/lib/python3.9/site-packages (11.5.0)
Can you try running the check manually with debug log output? Like this
agent check --log-level DEBUG nvml
It should be possible to see a traceback towards the end of the output that might help us figure out what the actual error is.
Here is the output. nvml-debug.txt
Interestingly the error is now different (I've updated the Datadog agent to 1.7.50 and pip3 to 23.3.2, but no other changes made.
Error: module 'pynvml' has no attribute 'nvmlDeviceGetComputeRunningProcesses_v2'
Traceback (most recent call last):
File "/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/base.py", line 1235, in run
self.check(instance)
File "/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/nvml/nvml.py", line 103, in check
self.gather(instance)
File "/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/nvml/nvml.py", line 116, in gather
self.gather_gpu(handle, tags)
File "/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/nvml/nvml.py", line 175, in gather_gpu
compute_running_processes = NvmlCheck.N.nvmlDeviceGetComputeRunningProcesses_v2(handle)
AttributeError: module 'pynvml' has no attribute 'nvmlDeviceGetComputeRunningProcesses_v2'
Ah, I was checking the Datadog NVML Integration Release Notes which still show 1.0.8 as the latest version.
Updating to 1.0.9 and the check status comes back as OK. No data in the dashboard yet, but happy to close the issue and re-open if it still doesn't work.
Thanks for the help!
Output of the info page
Additional environment details (Operating System, Cloud provider, etc):
Steps to reproduce the issue:
Describe the results you received:
Describe the results you expected: Metrics on the GPU's
Additional information you deem important (e.g. issue happens only occasionally): The integration has worked previously and changing the version has helped in the past, but not recently. This also reproduced across several machines.