didi / KnowStreaming

一站式云原生实时流数据平台,通过0侵入、插件化构建企业级Kafka服务,极大降低操作、存储和管理实时流数据门槛
https://knowstreaming.com
GNU Affero General Public License v3.0
6.9k stars 1.28k forks source link

LogiKM对接夜莺监控时告警规则无效的问题 #354

Closed PengShuaixin closed 2 years ago

PengShuaixin commented 3 years ago

问题描述

在KM里面成功创建告警规则的情况下,夜莺监控系统里面在达到触发规则的情况下,不能正常报警

原因初步分析

分析方法com.xiaojukeji.kafka.manager.monitor.component.n9e.N9eService#sinkMetrics中的请求发给夜莺的监控指标,数据如下:

[{"metric":"online-kafka-topic-msgIn","nid":"1","step":60,"tags":"cluster=TAGTIC_KAFKA,topic=events","timestamp":1626337144,"value":0.0},{"metric":"online-kafka-topic-bytesIn","nid":"1","step":60,"tags":"cluster=TAGTIC_KAFKA,topic=events","timestamp":1626337144,"value":0.0},{"metric":"online-kafka-topic-bytesRejected","nid":"1","step":60,"tags":"cluster=TAGTIC_KAFKA,topic=events","timestamp":1626337144,"value":0.0},{"metric":"online-kafka-topic-msgIn","nid":"1","step":60,"tags":"cluster=TAGTIC_KAFKA,topic=__consumer_offsets","timestamp":1626337144,"value":2.964393875E-314},{"metric":"online-kafka-topic-bytesIn","nid":"1","step":60,"tags":"cluster=TAGTIC_KAFKA,topic=__consumer_offsets","timestamp":1626337144,"value":2.964393875E-314},{"metric":"online-kafka-topic-bytesRejected","nid":"1","step":60,"tags":"cluster=TAGTIC_KAFKA,topic=__consumer_offsets","timestamp":1626337144,"value":0.0}]

如上数据,我们能看到tags里面的数据为cluster=TAGTIC_KAFKA;

再看com.xiaojukeji.kafka.manager.monitor.component.n9e.N9eService#createStrategy方法,发现向夜莺发送创建告警规则时,请求数据如下:

{"alert_dur":60,"alert_upgrade":{"duration":60,"groups":[],"level":1,"users":[]},"callback":"","category":2,"converge":[300,1],"enable_days_of_week":[0,1,2,3,4,5,6,7],"enable_etime":"23:59","enable_stime":"00:00","excl_nid":[],"exprs":[{"eopt":"<","func":"all","metric":"online-kafka-topic-msgIn","params":[1],"threshold":0}],"name":"测试-A","need_upgrade":0,"nid":1,"notify_group":[1],"notify_user":[],"priority":3,"recovery_dur":0,"recovery_notify":0,"tags":[{"tkey":"topic","topt":"=","tval":["events"]},{"tkey":"cluster","topt":"=","tval":["内网集群"]}]}

如上数据,发现{"tkey":"cluster","topt":"=","tval":["内网集群"]}cluster的标签值为内网集群;

其中,TAGTIC_KAFKA内网集群的关系在KM中如图:
0

在夜莺的界面中能看到如下的情况:

1

2

根据以上分析得出,导致告警无法触发的原因就是,KM在创建告警规则时使用的cluster标签是集群名称,而向夜莺上传监控指标时使用的cluster标签是集群标识,由此就导致告警系统在使用cluster标签过滤数据时出现了问题,最终导致告警规则在满足条件的情况下也无法触发。

临时解决方案

在夜莺监控页面,修改cluster过滤标签,从上图改成如下: 3 4

ZQKC commented 3 years ago

使用的版本是?之前旧版本在创建监控策略的时候,创建了机器相关的策略,应该是机器无关的,这块可以检查一下。

PengShuaixin commented 3 years ago

KM的版本是2.4.2,编译打包的代码是master分支最新的代码,夜莺版本是4.0.1

PengShuaixin commented 3 years ago

在KM页面上创建之后,告警规则展示如下: 0

1

ranqiqiang commented 3 years ago

这个tag 监控,必须要接入夜莺吗?

ZQKC commented 2 years ago

可以再对照一下夜莺平台上的指标,看是否满足触发条件,如果满足但是没有触发,估计得问一下夜莺相关的同学。具体的欢迎入群(README中有)沟通~

ZQKC commented 2 years ago

可以再对照一下夜莺平台上的指标,看是否满足触发条件,如果满足但是没有触发,估计得问一下夜莺相关的同学。具体的欢迎入群(README中有)沟通~

ZQKC commented 2 years ago

无更多反馈,关闭该问题,也欢迎入群交流~