flashcatcloud / categraf

one-stop telemetry collector for nightingale
https://flashcat.cloud/docs/
MIT License
857 stars 261 forks source link

添加snmp插件添加不存在的ip采集(ping不通,或ping通采集失败),将导致整个agents目标采集失败从而无数据 #933

Open robotneo opened 6 months ago

robotneo commented 6 months ago

Relevant config.toml

interval = 30

[[instances]]
agents = [
    "udp://172.16.42.1",
    "udp://172.16.42.2",
    "udp://172.16.42.3"  # 该IP是ping不通或不存在的对象
]

# metrics_name_prefix = "dell_"

interval_times = 1
labels = { region = "hangzhou", role = "idrac", brand = "dell" }

timeout = "5s"
version = 2
community = "public"
path = ["/opt/categraf/mibs/dell"]
translator = "gosmi"
agent_host_tag = "device_ip"
retries = 3
max_repetitions = 20

# iDRAC版本 racShortName
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.1.1.2.0"
name = "dell_idrac_shortname"
is_tag = true

# iDRAC固件版本 racFirmwareVersion
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.1.1.5.0"
name = "dell_idrac_firmware"
is_tag = true

# iDRAC界面的URL
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.1.1.6.0"
name = "dell_idrac_url"
is_tag = true

# # 设备名称 iDRAC-系统服务标签
# [[instances.field]]
# oid = ".1.3.6.1.2.1.1.5.0"
# name = "system_name"
# is_tag = true

# 服务标签
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.1.3.2.0"
name = "dell_service_tag"
is_tag = true

# 快速服务代码
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.1.3.3.0"
name = "dell_express_service_code"
is_tag = true

# 操作系统名称
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.1.3.6.0"
name = "dell_os_name"
is_tag = true

# 型号名称
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.1.3.12.0"
name = "dell_model_name"
is_tag = true

# 操作系统版本
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.1.3.14.0"
name = "dell_os_version"
is_tag = true

# 系统的唯一标识符或ID
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.1.3.13.0"
name = "dell_system_id"

# 设备全局系统状态 globalSystemStatus
# ObjectStatusEnum (INTEGER) 
# {other(1), unknown(2), ok(3), nonCritical(4), critical(5), nonRecoverable(6) }
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.2.1.0"
name = "dell_glob_system_status"

# 设备LCD状态 systemLCDStatus
# ObjectStatusEnum (INTEGER) 
# {other(1), unknown(2), ok(3), nonCritical(4), critical(5), nonRecoverable(6) }
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.2.2.0"
name = "dell_system_lcd_status"

# 设备全局存储状态 globalStorageStatus
# ObjectStatusEnum (INTEGER) 
# {other(1), unknown(2), ok(3), nonCritical(4), critical(5), nonRecoverable(6) }
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.2.3.0"
name = "dell_global_storage_status"

# 设备系统电源状态 systemPowerState
# PowerStateStatusEnum (INTEGER) 
# {other(1), unknown(2), off(3), on(4) }
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.2.4.0"
name = "dell_system_power_state"

# 设备系统电源在线时长 systemPowerUpTime 单位秒
# Unsigned32BitRange (INTEGER) 
# (0..2147483647)
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.2.5.0"
name = "dell_system_power_uptime"

# 事件日志中的条目数 numEventLogEntries
#  Unsigned32BitRange (INTEGER) 
# (0..2147483647) 
[[instances.field]]
oid = ".1.3.6.1.4.1.674.10892.5.4.300.1.0"
name = "dell_num_event_log"

Logs from categraf

instances.go:176: agent udp://172.16.42.3 ins: performing get on field dell_idrac_shortname(.1.3.6.1.4.1.674.10892.5.1.1.2.0): request timeout (after 3 retries)
2024/05/16 10:46:38 table.go:405: E! snmp walk error:request timeout (after 3 retries), oid:.1.3.6.1.4.1.674.10892.5.4.300.50.1.8 
2024/05/16 10:46:38 instances.go:182: agent udp://172.16.42.3 ins: gathering table dell error: performing bulk walk for field bios_version(.1.3.6.1.4.1.674.10892.5.4.300.50.1.8): request timeout (after 3 retries)
2024/05/16 10:46:58 table.go:405: E! snmp walk error:request timeout (after 3 retries), oid:.1.3.6.1.4.1.674.10892.5.4.300.60.1.4 
2024/05/16 10:46:58 instances.go:182: agent udp://172.16.42.3 ins: gathering table dell error: performing bulk walk for field firmware_state(.1.3.6.1.4.1.674.10892.5.4.300.60.1.4): request timeout (after 3 retries)

System info

Ubuntu 22.04.3

Docker

二进制和docker测试都一样

Steps to reproduce

snmp采集插件添加不存在的IP导致整个采集失败

Expected behavior

snmp采集插件添加不存在的IP导致整个采集失败

Actual behavior

snmp采集插件添加不存在的IP导致整个采集失败

Additional info

No response

robotneo commented 6 months ago

版本是:v0.3.63-cd87ed55ee9611a1208d801ae12f2d2fa481fb12

kongfei605 commented 6 months ago

不会的,每个target都有超时时间重试次数,最后时间到了正常的会有指标上报的。

max repetition 调成10 试试

robotneo commented 6 months ago

不好意思,是我急切了,有一会是没有数据,现在数据就一切正常了,我在观察会

robotneo commented 6 months ago

观察了会,样本数据点会中断,我截图给你看下。

image
robotneo commented 6 months ago

一会好了 一会数据中断,是不是采集间隔轮训导致数据中断了,会影响其他正常的也中断一会。

image
UlricQin commented 6 months ago

vm可能会有补点的逻辑,要看真实数据的话,还是得用 range vector,用 Table 视图。类似这样:

image

可以清楚的看到具体是哪些时间点上报的数据

robotneo commented 6 months ago

比如这个样本正常状态是的采集数据是3,当轮训到 172.16.42.3 的时候,采集不到就导致所有其他的正常IP画图也端点?

robotneo commented 6 months ago

正常能采集的对象指标 VM画图也给断点?就因为一个老鼠屎搅一锅粥

robotneo commented 6 months ago
image

确定下是categraf的snmp问题 还是vm问题,我现在去使用telegraf测试下,能否复现。

robotneo commented 6 months ago

1、categraf的SNMP插件采集数据插入到Prometheus中,agents中对象都是实际存在的,网络可通可采集的对象,Graph查询没有数据断点,测试时间:15~20分钟,时间范围缩小都没有断点。 2、添加网络不通的对象或采集不到的对象,写入Prometheus中,和VictoriaMetrics一样,还是出现断点问题,断点问题如下,日志报错如下所示:

image image image
kongfei605 commented 6 months ago

这个正常,timeout对应的那个周期的都会断

robotneo commented 6 months ago

timeout = "5s" retries = 3

这个是15秒?如果有一个对象采集不到,就所有的对象等一起等着,直到结束?,我觉得不太合理,是不是不要等样本全部返回后在把数据暴露出去,而是以instances来处理 避免其他正常的也出现断点 这样才合理一点

robotneo commented 6 months ago

func (ins Instance) Gather(slist types.SampleList) { for i, agent := range ins.Agents { var wg sync.WaitGroup wg.Add(1) go func(i int, agent string) { defer wg.Done() // First is the top-level fields. We treat the fields as table prefixes with an empty index. t := Table{ Name: ins.Name, Fields: ins.Fields,

            DebugMode: ins.DebugMod,
        }
        for idx, f := range t.Fields {
            t.Fields[idx].Oid = strings.TrimSpace(f.Oid)
        }
        topTags := map[string]string{}
        for k, v := range ins.GetLabels() {
            topTags[k] = v
        }
        extraTags := map[string]string{}
        if m, ok := ins.Mappings[agent]; ok {
            extraTags = m
        }
        if !ins.DisableUp {
            ins.up(slist, i)
        }

        gs, err := ins.getConnection(i)
        if err != nil {
            log.Printf("agent %s ins: %s", agent, err)
            return
        }
        if err := ins.gatherTable(slist, gs, t, topTags, extraTags, false); err != nil {
            log.Printf("agent %s ins: %s", agent, err)
        }

        // Now is the real tables.
        for _, t := range ins.Tables {
            if err := ins.gatherTable(slist, gs, t, topTags, extraTags, true); err != nil {
                log.Printf("agent %s ins: gathering table %s error: %s", agent, t.Name, err)
            }
        }
    }(i, agent)
    wg.Wait() // 等待单个采集完成后立即存储结果 改下?
}

}

应该是这一段吧,等待所有agents完成采集,然后样本值在输出出去

kongfei605 commented 6 months ago

你想改成什么样子?

robotneo commented 6 months ago

就是避免因为一个agent 影响所有agent的数据转储 可以边采集边转储 其中一个挂了(网络不通)不影响其他的转储,这样在图表上就不会因为一个agent因为重试导致所有agent都等待中 图表无数据出现断点问题

kongfei605 commented 6 months ago

同一个instances中采集周期是一样的,超时判断逻辑也应该一样,完全没必要设置那么大的超时+重试次数。

如果不同时控制,挂了的设备会堆积探测goroutine ,超时+重试设置越大堆积会越多(超时+重试期间下一轮探测又开始了)。

这种情况,最好是加一个目标IP标记+旁路探测逻辑来完成,这个和non accessible oid的采集比起来优先级没有那么高。

robotneo commented 6 months ago

谢谢解惑,这个有计划弄嘛

kongfei605 commented 6 months ago

有,长期优化点。