DataDog / integrations-extras

Community developed integrations and plugins for the Datadog Agent.
BSD 3-Clause "New" or "Revised" License
256 stars 750 forks source link

Unable to start NVML integration #2382

Open Julia-elsammak opened 6 months ago

Julia-elsammak commented 6 months ago

Output of the info page

When installing NVML integration, getting the following error:

Loading Errors

nvml
----
  Core Check Loader:
    Check nvml not found in Catalog

  JMX Check Loader:
    check is not a jmx check, or unable to determine if it's so

  Python Check Loader:
    unable to import module 'nvml': No module named 'nvml'`

Looking at the debug logs

2024-05-11 18:18:54 CST | CORE | DEBUG | (pkg/collector/python/loader.go:158 in Load) | Unable to load python module - datadog_checks.nvml: unable to import module 'datadog_checks.nvml': Traceback (most recent call last):
  File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/nvml/__init__.py", line 5, in <module>
    from .nvml import NvmlCheck
  File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/nvml/nvml.py", line 16, in <module>
    from .api_pb2 import ListPodResourcesRequest
  File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/nvml/api_pb2.py", line 25, in <module>
    _LISTPODRESOURCESREQUEST = _descriptor.Descriptor(
                               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/google/protobuf/descriptor.py", line 296, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates`

To fix this issue:

basilnsage commented 6 months ago

Tagging @cep21 and @cswatt who have worked on this before, if you'd be so kind as to have a look please.

cep21 commented 6 months ago

All of those fixes seem reasonable. As datadog's officially supporting the NVIDA DCGM Exporter now, I've deprecated the nvml plugin internally. It may be best to add it as deprecated here as well. Someone could also modify the plugin to refuse to install for newer datadog versions,but I won't have time to contribute this.

tmart-ops commented 4 months ago

datadog-agent updates have broken this integration for me as well. I've been able to use the DCGM exporter but it requires running the DCGM exporter container which is less than ideal if it's a machine that doesn't run Docker.

maxgio92 commented 2 weeks ago

While it's not an optimal workaround, I've made the check work using pure-Python Protobuf implementation:

PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python agent check nvml
...

  Running Checks
  ==============

    nvml (1.0.9)
    ------------
      Instance ID: nvml:b6f35e1900952b0b [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/nvml.yaml
      Total Runs: 1
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 1ms
      Last Execution Date : 2024-11-11 12:28:56 UTC (1731328136000)
      Last Successful Execution Date : 2024-11-11 12:28:56 UTC (1731328136000)

  Metadata
  ========
    config.hash: nvml:b6f35e1900952b0b
    config.provider: file
Check has run only once, if some metrics are missing you can try again with --check-rate to see any other metric if available.
This check type has 1 instances. If you're looking for a different check instance, try filtering on a specific one using the --instance-filter flag or set --discovery-min-instances to a higher value

This means that it would needed to be applied at agent level for all checks I guess - I'm not aware of being able to use the non-C++ implementation only for this check.

maxgio92 commented 2 weeks ago

Trying to solve the issue at the root, I think we can release a new patch version for nvml regenerating the Python protobuf code, with something like:

$ protoc --python_out=nvml/datadog_checks/nvml nvml/datadog_checks/nvml/api.proto
maxgio92 commented 2 weeks ago

JFI I've opened #2535, tested against Datadog Agent v7.59.0.