nvidia_smi 插件报错

Derek-zd commented 2 months ago

Relevant config.toml

# interval = 15

# exec local command
# e.g. nvidia_smi_command = "nvidia-smi"
nvidia_smi_command = "nvidia-smi"

# exec remote command
# nvidia_smi_command = "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null SSH_USER@SSH_HOST nvidia-smi"

# Comma-separated list of the query fields.
# You can find out possible fields by running `nvidia-smi --help-query-gpus`.
# The value `AUTO` will automatically detect the fields to query.
query_field_names = "AUTO"

# query_timeout is used to set the query timeout to avoid the delay of date collection.
query_timeout = "5s"

Logs from categraf

Sep 19 16:23:52 zj-4090-59 categraf[79833]: 2024/09/19 16:23:52 metrics_agent.go:276: E! failed to init input: local.nvidia_smi error: unexpected query field: vgpu_driver_capability.heterogenous_multivGPU

System info

Ubuntu 22.04

Docker

No response

Steps to reproduce

1.开启 nvidia_smi 插件

正常有监控数据
显卡出问题了，掉卡了，通常表现为 nvidia_smi 命令卡住出不来，或命令报错 nable to determine the device handle for GPU0000:CF:00.0: Unknown Errornon-zero return code 。使用 nvidia_smi -L 命令可以看到正常的卡和错误的卡。
categraf 会不再上报显卡监控数据，导致告警失效。 ...

Expected behavior

有显卡掉卡时监控数据，其他正常的卡的监控数据可以继续正常上报，

Actual behavior

不再上报显卡相关的监控数据

Additional info

No response

kongfei605 commented 2 months ago

打开nvidia_timeout 呢

Derek-zd commented 2 months ago

img_v3_02et_5caedd98-552d-4db7-829d-bb3c3254286h 这次执行 nvidia-smi --query-gpu 时的错误，红色部分

Derek-zd commented 2 months ago

打开nvidia_timeout 呢

配置里的 query_timeout = "5s" 吗？这个时开着的。 nvidia-smi 命令卡住值 timeout 等一年也是卡住的。可以忽略这种卡住的情况，插件就是无法工作，无法处理的。但是上面我刚发的图的这个，我靠，插件调用的这个命令有返回，但是个错误的好像也没法处理了。无解了

kongfei605 commented 2 months ago

那不应该，超时后，会调用kill命令

Derek-zd commented 2 months ago

那不应该，超时后，会调用kill命令

kill 掉也没用啊，下次查询还会卡住，再查询再卡住再 kill。循环往复，从显卡故障后就没有监控数据上报了

kongfei605 commented 2 months ago

那你应该修复故障啊，源头挂了，你要采集器帮你修？

Derek-zd commented 1 month ago

我的意思是监控采集不到故障信息，无法做对应告警配置

kongfei605 commented 1 month ago

no data 可以用absent之类的函数

flashcatcloud / categraf