[[instances]]
# path to the file, that contains the DCGM fields to collect
collectors = "conf/input.dcgm/default-counters.csv"
# Enable kubernetes mapping metrics to kubernetes pods
# kubernetes=false
# Choose Type of GPU ID to use to map kubernetes resources to pods. Possible values: "uid", "device-name"
# kubernetes-gpu-id-type = "uid"
# Use old 1.x namespace
# use-old-namespace = false
cpu-devices = "f"
# gpu devices
devices = "f"
switch-devices = "f"
# ConfigMap <NAMESPACE>:<NAME> for metric data
configmap-data = "none"
# Connect to remote hostengine at <HOST>:<PORT>
# remote-hostengine-info = "localhost:5555"
# Accept GPUs that are fake, for testing purposes only
# fake-gpus = false
# Replaces every blank space in the GPU model name with a dash, ensuring a continuous, space-free identifier.
# replace-blanks-in-model-name = false
Logs from categraf
2024/06/25 11:52:55 main.go:149: I! runner.binarydir: /root/categraf-v0.3.69-linux-amd64-with-cgo-plugin
2024/06/25 11:52:55 main.go:150: I! runner.hostname: iv-ytcid8dd
2024/06/25 11:52:55 main.go:151: I! runner.fd_limits: (soft=65535, hard=65535)
2024/06/25 11:52:55 main.go:152: I! runner.vm_limits: (soft=unlimited, hard=unlimited)
2024/06/25 11:52:55 provider_manager.go:60: I! use input provider: [local]
2024/06/25 11:52:55 prometheus_agent.go:19: I! prometheus scraping disabled!
2024/06/25 11:52:55 ibex_agent.go:19: I! ibex agent disabled!
2024/06/25 11:52:55 agent.go:38: I! agent starting
2024/06/25 11:52:55 metrics_agent.go:316: I! input: local.cpu started
2024/06/25 11:52:55 metrics_reader.go:54: D! local.cpu : before gather once
2024/06/25 11:52:55 exporter.go:151: Starting dcgm-exporter
2024/06/25 11:52:55 metrics_reader.go:60: D! local.cpu : after gather once, duration: 121.57µs
2024/06/25 11:52:55 exporter.go:155: &{CollectorsFile:conf/input.dcgm/default-counters.csv Address: CollectInterval:0 Kubernetes:false KubernetesGPUIdType: CollectDCP:true UseOldNamespace:false UseRemoteHE:false RemoteHEInfo: GPUDevices:{Flex:true MajorRange:[] MinorRange:[]} SwitchDevices:{Flex:true MajorRange:[] MinorRange:[]} CPUDevices:{Flex:true MajorRange:[] MinorRange:[]} NoHostname:false UseFakeGPUs:false ConfigMapData:none MetricGroups:[] WebSystemdSocket:false WebConfigFile: XIDCountWindowSize:0 ReplaceBlanksInModelName:false Debug:true ClockEventsCountWindowSize:0}
2024/06/25 11:52:55 exporter.go:170: DCGM successfully initialized!
2024/06/25 11:52:55 exporter.go:181: Collecting DCP Metrics
INFO[0000] Falling back to metric file 'conf/input.dcgm/default-counters.csv'
INFO[0000] Initializing system entities of type: GPU
2024/06/25 11:52:55 exporter.go:215: Not collecting NvSwitch metrics; no fields to watch for device type: 3
2024/06/25 11:52:55 exporter.go:215: Not collecting NvLink metrics; no fields to watch for device type: 6
2024/06/25 11:52:55 exporter.go:215: Not collecting CPU metrics; no fields to watch for device type: 7
2024/06/25 11:52:55 exporter.go:215: Not collecting CPU Core metrics; no fields to watch for device type: 8
2024/06/25 11:52:55 metrics_agent.go:316: I! input: local.dcgm started
2024/06/25 11:52:55 metrics_agent.go:241: E! input: local.nvidia_smi.bak not supported
2024/06/25 11:52:55 agent.go:46: I! [*agent.MetricsAgent] started
2024/06/25 11:52:55 agent.go:49: I! agent started
2024/06/25 11:52:55 metrics_reader.go:54: D! local.dcgm : before gather once
11:52:55 DCGM_FI_DEV_XID_ERRORS DCGM_FI_DRIVER_VERSION=0 Hostname=iv-ytcid8dd UUID=GPU-751ce948 agent_hostname=iv-ytcid8dd device=nvidia0 gpu=0 modelName=NVIDIA L20 0
Relevant config.toml
Logs from categraf
System info
categraf-v0.3.69-linux-amd64-with-cgo-plugin, Debian 5.4.250-2-velinux1u3
Docker
No response
Steps to reproduce
...
Expected behavior
期望收到正确的GPU 指标信息和值
Actual behavior
收到GPU 指标的值是0
Additional info
No response