flashcatcloud / categraf

one-stop telemetry collector for nightingale
https://flashcat.cloud/docs/
MIT License
756 stars 237 forks source link

categraf-v0.3.69-linux-amd64-with-cgo 不能正确获取GPU指标的值 #980

Open EnyaWong opened 1 week ago

EnyaWong commented 1 week ago

Relevant config.toml

[[instances]]
# path to the file, that contains the DCGM fields to collect
collectors = "conf/input.dcgm/default-counters.csv"

# Enable kubernetes mapping metrics to kubernetes pods
# kubernetes=false

# Choose Type of GPU ID to use to map kubernetes resources to pods. Possible values: "uid", "device-name"
# kubernetes-gpu-id-type = "uid"

# Use old 1.x namespace
# use-old-namespace = false

  cpu-devices = "f"

# gpu devices
  devices = "f"

  switch-devices = "f"

# ConfigMap <NAMESPACE>:<NAME> for metric data
  configmap-data = "none"

# Connect to remote hostengine at <HOST>:<PORT>
# remote-hostengine-info = "localhost:5555"

# Accept GPUs that are fake, for testing purposes only
# fake-gpus = false

# Replaces every blank space in the GPU model name with a dash, ensuring a continuous, space-free identifier.
# replace-blanks-in-model-name = false

Logs from categraf

2024/06/25 11:52:55 main.go:149: I! runner.binarydir: /root/categraf-v0.3.69-linux-amd64-with-cgo-plugin
2024/06/25 11:52:55 main.go:150: I! runner.hostname: iv-ytcid8dd
2024/06/25 11:52:55 main.go:151: I! runner.fd_limits: (soft=65535, hard=65535)
2024/06/25 11:52:55 main.go:152: I! runner.vm_limits: (soft=unlimited, hard=unlimited)
2024/06/25 11:52:55 provider_manager.go:60: I! use input provider: [local]
2024/06/25 11:52:55 prometheus_agent.go:19: I! prometheus scraping disabled!
2024/06/25 11:52:55 ibex_agent.go:19: I! ibex agent disabled!
2024/06/25 11:52:55 agent.go:38: I! agent starting
2024/06/25 11:52:55 metrics_agent.go:316: I! input: local.cpu started
2024/06/25 11:52:55 metrics_reader.go:54: D! local.cpu : before gather once
2024/06/25 11:52:55 exporter.go:151: Starting dcgm-exporter
2024/06/25 11:52:55 metrics_reader.go:60: D! local.cpu : after gather once, duration: 121.57µs
2024/06/25 11:52:55 exporter.go:155: &{CollectorsFile:conf/input.dcgm/default-counters.csv Address: CollectInterval:0 Kubernetes:false KubernetesGPUIdType: CollectDCP:true UseOldNamespace:false UseRemoteHE:false RemoteHEInfo: GPUDevices:{Flex:true MajorRange:[] MinorRange:[]} SwitchDevices:{Flex:true MajorRange:[] MinorRange:[]} CPUDevices:{Flex:true MajorRange:[] MinorRange:[]} NoHostname:false UseFakeGPUs:false ConfigMapData:none MetricGroups:[] WebSystemdSocket:false WebConfigFile: XIDCountWindowSize:0 ReplaceBlanksInModelName:false Debug:true ClockEventsCountWindowSize:0}
2024/06/25 11:52:55 exporter.go:170: DCGM successfully initialized!
2024/06/25 11:52:55 exporter.go:181: Collecting DCP Metrics
INFO[0000] Falling back to metric file 'conf/input.dcgm/default-counters.csv'
INFO[0000] Initializing system entities of type: GPU
2024/06/25 11:52:55 exporter.go:215: Not collecting NvSwitch metrics; no fields to watch for device type: 3
2024/06/25 11:52:55 exporter.go:215: Not collecting NvLink metrics; no fields to watch for device type: 6
2024/06/25 11:52:55 exporter.go:215: Not collecting CPU metrics; no fields to watch for device type: 7
2024/06/25 11:52:55 exporter.go:215: Not collecting CPU Core metrics; no fields to watch for device type: 8
2024/06/25 11:52:55 metrics_agent.go:316: I! input: local.dcgm started
2024/06/25 11:52:55 metrics_agent.go:241: E! input: local.nvidia_smi.bak not supported
2024/06/25 11:52:55 agent.go:46: I! [*agent.MetricsAgent] started
2024/06/25 11:52:55 agent.go:49: I! agent started
2024/06/25 11:52:55 metrics_reader.go:54: D! local.dcgm : before gather once
11:52:55 DCGM_FI_DEV_XID_ERRORS DCGM_FI_DRIVER_VERSION=0 Hostname=iv-ytcid8dd UUID=GPU-751ce948 agent_hostname=iv-ytcid8dd device=nvidia0 gpu=0 modelName=NVIDIA L20 0

System info

categraf-v0.3.69-linux-amd64-with-cgo-plugin, Debian 5.4.250-2-velinux1u3

Docker

No response

Steps to reproduce

  1. 下载categraf-v0.3.69-linux-amd64-with-cgo-plugin
  2. ./categraf --debug
  3. ...

Expected behavior

期望收到正确的GPU 指标信息和值

Actual behavior

收到GPU 指标的值是0

Additional info

No response

kongfei605 commented 1 week ago

最后的日志下面没了? 没看到after gather once?