DataDog / integrations-extras

Community developed integrations and plugins for the Datadog Agent.
BSD 3-Clause "New" or "Revised" License
254 stars 742 forks source link

NVML Integration - "unable to import module 'nvml': No module named 'nvml'" #2154

Closed B3nihana closed 10 months ago

B3nihana commented 1 year ago

Output of the info page

Getting the status from the agent.

===============
Agent (v7.48.1)
===============

  Status date: 2023-10-26 08:55:52.427 UTC (1698310552427)
  Agent start: 2023-10-26 08:49:57.155 UTC (1698310197155)
  Pid: 35405
  Go Version: go1.20.8
  Python Version: 3.9.18
  Build arch: amd64
  Agent flavor: agent
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -946µs
    System time: 2023-10-26 08:55:52.427 UTC (1698310552427)

  Host Info
  =========
    bootTime: 2023-01-20 11:59:21 UTC (1674215961000)
    hostId: c433effb-460b-4595-888d-9924e84245b0
    kernelArch: x86_64
    kernelVersion: 4.15.0-197-generic
    os: linux
    platform: ubuntu
    platformFamily: debian
    platformVersion: 18.04
    procs: 1511
    uptime: 6692h50m49s
    virtualizationRole: host
    virtualizationSystem: kvm

  Hostnames
  =========
    hostname: _removed_
    socket-fqdn: _removed_
    socket-hostname: _removed_
    hostname provider: os
    unused hostname providers:
      'hostname' configuration/environment: hostname is empty
      'hostname_file' configuration/environment: 'hostname_file' configuration is not enabled
      aws: not retrieving hostname from AWS: the host is not an ECS instance and other providers already retrieve non-default hostnames
      azure: azure_hostname_style is set to 'os'
      container: the agent is not containerized
      fargate: agent is not runnning on Fargate
      fqdn: 'hostname_fqdn' configuration is not enabled
      gce: unable to retrieve hostname from GCE: GCE metadata API error: Get "http://169.254.169.254/computeMetadata/v1/instance/hostname": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

  Metadata
  ========
    agent_version: 7.48.1
    config_apm_dd_url:
    config_dd_url:
    config_logs_dd_url:
    config_logs_socks5_proxy_address:
    config_no_proxy: [169.254.169.254 100.100.100.200]
    config_process_dd_url:
    config_proxy_http:
    config_proxy_https:
    config_site:
    feature_apm_enabled: true
    feature_cspm_enabled: false
    feature_cws_enabled: false
    feature_cws_network_enabled: true
    feature_cws_remote_config_enabled: false
    feature_cws_security_profiles_enabled: false
    feature_dynamic_instrumentation_enabled: false
    feature_fips_enabled: false
    feature_imdsv2_enabled: false
    feature_logs_enabled: false
    feature_networks_enabled: false
    feature_networks_http_enabled: false
    feature_networks_https_enabled: false
    feature_oom_kill_enabled: false
    feature_otlp_enabled: false
    feature_process_enabled: true
    feature_process_language_detection_enabled: false
    feature_processes_container_enabled: true
    feature_remote_configuration_enabled: true
    feature_tcp_queue_length_enabled: false
    feature_usm_enabled: false
    feature_usm_go_tls_enabled: false
    feature_usm_http2_enabled: false
    feature_usm_http_by_status_code_enabled: false
    feature_usm_istio_enabled: false
    feature_usm_java_tls_enabled: false
    feature_usm_kafka_enabled: false
    flavor: agent
    hostname_source: os
    install_method_installer_version: install_script-1.12.0
    install_method_tool: install_script
    install_method_tool_version: install_script_agent7
    system_probe_core_enabled: true
    system_probe_gateway_lookup_enabled: true
    system_probe_kernel_headers_download_enabled: false
    system_probe_max_connections_per_message: 600
    system_probe_prebuilt_fallback_enabled: true
    system_probe_protocol_classification_enabled: true
    system_probe_root_namespace_enabled: true
    system_probe_runtime_compilation_enabled: false
    system_probe_telemetry_enabled: true
    system_probe_track_tcp_4_connections: true
    system_probe_track_tcp_6_connections: true
    system_probe_track_udp_4_connections: true
    system_probe_track_udp_6_connections: true

=========
Collector
=========

  Running Checks
  ==============

    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
      Total Runs: 24
      Metric Samples: Last Run: 9, Total: 209
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-10-26 08:55:44 UTC (1698310544000)
      Last Successful Execution Date : 2023-10-26 08:55:44 UTC (1698310544000)

    disk (5.0.0)
    ------------
      Instance ID: disk:67cc0574430a16ba [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
      Total Runs: 24
      Metric Samples: Last Run: 347, Total: 8,328
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 42ms
      Last Execution Date : 2023-10-26 08:55:51 UTC (1698310551000)
      Last Successful Execution Date : 2023-10-26 08:55:51 UTC (1698310551000)

    file_handle
    -----------
      Instance ID: file_handle [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
      Total Runs: 23
      Metric Samples: Last Run: 5, Total: 115
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-10-26 08:55:43 UTC (1698310543000)
      Last Successful Execution Date : 2023-10-26 08:55:43 UTC (1698310543000)

    io
    --
      Instance ID: io [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
      Total Runs: 24
      Metric Samples: Last Run: 275, Total: 6,411
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 1ms
      Last Execution Date : 2023-10-26 08:55:50 UTC (1698310550000)
      Last Successful Execution Date : 2023-10-26 08:55:50 UTC (1698310550000)

    load
    ----
      Instance ID: load [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
      Total Runs: 23
      Metric Samples: Last Run: 6, Total: 138
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-10-26 08:55:42 UTC (1698310542000)
      Last Successful Execution Date : 2023-10-26 08:55:42 UTC (1698310542000)

    memory
    ------
      Instance ID: memory [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
      Total Runs: 24
      Metric Samples: Last Run: 20, Total: 480
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-10-26 08:55:49 UTC (1698310549000)
      Last Successful Execution Date : 2023-10-26 08:55:49 UTC (1698310549000)

    network (3.0.0)
    ---------------
      Instance ID: network:4b0649b7e11f0772 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
      Total Runs: 23
      Metric Samples: Last Run: 162, Total: 3,726
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 6ms
      Last Execution Date : 2023-10-26 08:55:41 UTC (1698310541000)
      Last Successful Execution Date : 2023-10-26 08:55:41 UTC (1698310541000)

    ntp
    ---
      Instance ID: ntp:3c427a42a70bbf8 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
      Total Runs: 1
      Metric Samples: Last Run: 1, Total: 1
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 1
      Average Execution Time : 26ms
      Last Execution Date : 2023-10-26 08:50:02 UTC (1698310202000)
      Last Successful Execution Date : 2023-10-26 08:50:02 UTC (1698310202000)

    uptime
    ------
      Instance ID: uptime [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
      Total Runs: 24
      Metric Samples: Last Run: 1, Total: 24
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-10-26 08:55:48 UTC (1698310548000)
      Last Successful Execution Date : 2023-10-26 08:55:48 UTC (1698310548000)

  **Loading Errors
  ==============
    nvml
    ----
      Core Check Loader:
        Check nvml not found in Catalog

      JMX Check Loader:
        check is not a jmx check, or unable to determine if it's so

      Python Check Loader:
        unable to import module 'nvml': No module named 'nvml'**

========
JMXFetch
========

  Information
  ==================
  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    Cluster: 0
    ClusterRole: 0
    ClusterRoleBinding: 0
    CronJob: 0
    CustomResource: 0
    CustomResourceDefinition: 0
    DaemonSet: 0
    Deployment: 0
    Dropped: 0
    HighPriorityQueueFull: 0
    HorizontalPodAutoscaler: 0
    Ingress: 0
    Job: 0
    Namespace: 0
    Node: 0
    OrchestratorManifest: 0
    PersistentVolume: 0
    PersistentVolumeClaim: 0
    Pod: 0
    ReplicaSet: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Role: 0
    RoleBinding: 0
    Service: 0
    ServiceAccount: 0
    StatefulSet: 0
    VerticalPodAutoscaler: 0

  Transaction Successes
  =====================
    Total number: 52
    Successes By Endpoint:
      check_run_v1: 23
      intake: 5
      metadata_v1: 1
      series_v2: 23

  On-disk storage
  ===============
    On-disk storage is disabled. Configure `forwarder_storage_max_size_in_bytes` to enable it.

  API Keys status
  ===============
    API key ending with 8028c: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.eu - API Key ending with:
      - 8028c

==========
Logs Agent
==========

  Logs Agent is not running

=============
Process Agent
=============

  Version: 7.48.1
  Status date: 2023-10-26 08:55:52.433 UTC (1698310552433)
  Process Agent Start: 2023-10-26 08:49:57.237 UTC (1698310197237)
  Pid: 35406
  Go Version: go1.20.8
  Build arch: amd64
  Log Level: info
  Enabled Checks: [process rtprocess]
  Allocated Memory: 37,287,728 bytes
  Hostname: _removed_
  System Probe Process Module Status: Not running
  Process Language Detection Enabled: False

  =================
  Process Endpoints
  =================
    https://process.datadoghq.eu - API Key ending with:
        - 8028c

  =========
  Collector
  =========
    Last collection time: 2023-10-26 08:55:42
    Docker socket: /var/run/docker.sock
    Number of processes: 767
    Number of containers: 0
    Process Queue length: 0
    RTProcess Queue length: 0
    Connections Queue length: 0
    Event Queue length: 0
    Pod Queue length: 0
    Process Bytes enqueued: 0
    RTProcess Bytes enqueued: 0
    Connections Bytes enqueued: 0
    Event Bytes enqueued: 0
    Pod Bytes enqueued: 0
    Drop Check Payloads: []

=========
APM Agent
=========
  Status: Running
  Pid: 35408
  Uptime: 355 seconds
  Mem alloc: 13,490,440 bytes
  Hostname: _removed_
  Receiver: localhost:8126
  Endpoints:
    https://trace.agent.datadoghq.eu

  Receiver (previous minute)
  ==========================
    No traces received in the previous minute.

  Writer (previous minute)
  ========================
    Traces: 0 payloads, 0 traces, 0 events, 0 bytes
    Stats: 0 payloads, 0 stats buckets, 0 bytes

==========
Aggregator
==========
  Checks Metric Sample: 19,834
  Dogstatsd Metric Sample: 3,731
  Event: 1
  Events Flushed: 1
  Number Of Flushes: 23
  Series Flushed: 18,365
  Service Check: 1
  Service Checks Flushed: 24

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 3,730
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 334,935
  Udp Packet Reading Errors: 0
  Udp Packets: 2,222
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0
  Unterminated Metric Errors: 0

====================
Remote Configuration
====================

    Organization enabled: False
    API Key: Not authorized, add the Remote Configuration Read permission to enable it for this agent.
    Last error: None

====
OTLP
====

  Status: Not enabled
  Collector status: Not running

Additional environment details (Operating System, Cloud provider, etc):

Steps to reproduce the issue:

  1. Install Ubuntu DD Agent
  2. Install NVML integration
  3. Check the host in Datadog

Describe the results you received:

Datadog’s nvml integration is reporting: Instance #initialization[ERROR]: {"Core Check Loader":"Check nvml not found in Catalog","JMX Check Loader":"check is not a jmx check, or unable to determine if it's so","Python Check Loader":"unable to import module 'nvml': No module named 'nvml'"}

Describe the results you expected: Metrics on the GPU's

Additional information you deem important (e.g. issue happens only occasionally): The integration has worked previously and changing the version has helped in the past, but not recently. This also reproduced across several machines.

alopezz commented 11 months ago

Did you install the dependencies as per the instructions?

Specifically

# You may also need to install dependencies since those aren't packaged into the wheel
sudo -u dd-agent -H /opt/datadog-agent/embedded/bin/pip3 install grpcio pynvml
B3nihana commented 11 months ago

Did you install the dependencies as per the instructions?

Specifically

# You may also need to install dependencies since those aren't packaged into the wheel
sudo -u dd-agent -H /opt/datadog-agent/embedded/bin/pip3 install grpcio pynvml

I did follow the instructions, but I just checked again on the machines with this issue and running the command you pulled out results in:

Requirement already satisfied: grpcio in /opt/datadog-agent/embedded/lib/python3.9/site-packages (1.59.2 Requirement already satisfied: pynvml in /opt/datadog-agent/embedded/lib/python3.9/site-packages (11.5.0)

alopezz commented 10 months ago

Can you try running the check manually with debug log output? Like this

agent check --log-level DEBUG nvml

It should be possible to see a traceback towards the end of the output that might help us figure out what the actual error is.

B3nihana commented 10 months ago

Here is the output. nvml-debug.txt

Interestingly the error is now different (I've updated the Datadog agent to 1.7.50 and pip3 to 23.3.2, but no other changes made.

    Error: module 'pynvml' has no attribute 'nvmlDeviceGetComputeRunningProcesses_v2'
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/base.py", line 1235, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/nvml/nvml.py", line 103, in check
          self.gather(instance)
        File "/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/nvml/nvml.py", line 116, in gather
          self.gather_gpu(handle, tags)
        File "/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/nvml/nvml.py", line 175, in gather_gpu
          compute_running_processes = NvmlCheck.N.nvmlDeviceGetComputeRunningProcesses_v2(handle)
      AttributeError: module 'pynvml' has no attribute 'nvmlDeviceGetComputeRunningProcesses_v2'
alopezz commented 10 months ago

Yeah, this is a different error, and fortunately it seems to have been solved in a more recent release of the integration (1.0.9). This was originally reported here. Can you install that version and see if it works now?

B3nihana commented 10 months ago

Ah, I was checking the Datadog NVML Integration Release Notes which still show 1.0.8 as the latest version.

DD NVML Release notes

Updating to 1.0.9 and the check status comes back as OK. No data in the dashboard yet, but happy to close the issue and re-open if it still doesn't work.

Thanks for the help!