flashcatcloud / categraf

one-stop telemetry collector for nightingale
https://flashcat.cloud/docs/
MIT License
854 stars 259 forks source link

nvidia_smi 插件报错 #1056

Open Derek-zd opened 2 months ago

Derek-zd commented 2 months ago

Relevant config.toml

# interval = 15

# exec local command
# e.g. nvidia_smi_command = "nvidia-smi"
nvidia_smi_command = "nvidia-smi"

# exec remote command
# nvidia_smi_command = "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null SSH_USER@SSH_HOST nvidia-smi"

# Comma-separated list of the query fields.
# You can find out possible fields by running `nvidia-smi --help-query-gpus`.
# The value `AUTO` will automatically detect the fields to query.
query_field_names = "AUTO"

# query_timeout is used to set the query timeout to avoid the delay of date collection.
query_timeout = "5s"

Logs from categraf

Sep 19 16:23:52 zj-4090-59 categraf[79833]: 2024/09/19 16:23:52 metrics_agent.go:276: E! failed to init input: local.nvidia_smi error: unexpected query field: vgpu_driver_capability.heterogenous_multivGPU

System info

Ubuntu 22.04

Docker

No response

Steps to reproduce

1.开启 nvidia_smi 插件

  1. 正常有监控数据

  2. 显卡出问题了,掉卡了,通常表现为 nvidia_smi 命令卡住出不来,或命令报错 nable to determine the device handle for GPU0000:CF:00.0: Unknown Errornon-zero return code 。使用 nvidia_smi -L 命令可以看到正常的卡和错误的卡。 20240919-181818

  3. categraf 会不再上报 显卡监控数据,导致 告警失效。 ...

Expected behavior

有显卡掉卡时监控数据,其他正常的卡的监控数据可以继续正常上报,

Actual behavior

不再上报显卡相关的监控数据

Additional info

No response

kongfei605 commented 2 months ago

打开nvidia_timeout 呢

Derek-zd commented 2 months ago

img_v3_02et_5caedd98-552d-4db7-829d-bb3c3254286h 这次执行 nvidia-smi --query-gpu 时的错误,红色部分

Derek-zd commented 2 months ago

打开nvidia_timeout 呢

配置里的 query_timeout = "5s" 吗?这个时开着的。 nvidia-smi 命令卡住值 timeout 等一年也是卡住的。可以忽略这种卡住的情况,插件就是无法工作,无法处理的。 但是上面我刚发的图的这个,我靠,插件调用的这个命令有返回,但是个错误的好像也没法处理了。无解了

kongfei605 commented 2 months ago

那不应该, 超时后,会调用kill命令

Derek-zd commented 2 months ago

那不应该, 超时后,会调用kill命令

kill 掉也没用啊,下次查询还会卡住,再 查询 再卡住再 kill。循环往复,从显卡故障后就没有监控数据上报了

kongfei605 commented 2 months ago

那你应该修复故障啊,源头挂了,你要采集器帮你修?

Derek-zd commented 1 month ago

我的意思是监控采集不到故障信息,无法做对应告警配置

kongfei605 commented 1 month ago

no data 可以用absent之类的函数