danielqsj / kafka_exporter

Kafka exporter for Prometheus
Apache License 2.0
2.09k stars 602 forks source link

Share my prometheus alert rules (based on Kafka exporter metrics) #413

Closed shengbinxu closed 7 months ago

shengbinxu commented 7 months ago

1、 kafka instance down

 - alert: aliyun_kafka_down
    expr: kafka_brokers{}<3
    for: 10s
    labels:
      severity: critical
    annotations:
      summary: "kafka挂了"
      description: "kafka当前节点数{{$value}}, 低于3个,请关注"

3、A certain topic suddenly ran out of data

- alert: kafka_produce_exception
    expr: sum(rate(kafka_topic_partition_current_offset{}[10m] offset 1d )) by (job,topic) >1 and sum(rate(kafka_topic_partition_current_offset{}[10m] )) by (job,topic) == 0 
    for: 60s
    labels:
      severity: critical
    annotations:
      summary: "kafka produce 异常"
      description: "集群{{$labels.job}}, topic:{{$labels}} 最近10m分钟写入速率为0,昨天同时段写入速率为{{$value}},请关注"

4、instance write exception

- alert: kafka_no_new_message
    expr: sum(rate(kafka_topic_partition_current_offset[1m])) by (instance) == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "kafka实例写入数据量为0"
      description: "{{$labels.instance}}新增数据量为0,请关注."

5、Kafka consumer suddenly slows down

alert: kafka_consumer_slow
    expr:  ((sum by (job,consumergroup, topic, partition) (delta(kafka_consumergroup_current_offset[5m]) / 5)) / (sum by (job,consumergroup, topic, partition) (delta(kafka_consumergroup_current_offset[5m] offset 1d) / 5) > 0) < 0.3) and (sum by (job,consumergroup, topic, partition) (delta(kafka_consumergroup_current_offset[5m]) / 5) > 0) and (sum by (job,consumergroup, topic, partition) (kafka_consumergroup_lag) > 10) and (sum by (job,consumergroup, topic, partition) (kafka_consumergroup_lag) / sum by (job,consumergroup, topic, partition) (kafka_consumergroup_lag offset 1d) > 2)
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "kafka消费变慢"
      description: "集群{{$labels.job}},消费者组:{{$labels.consumergroup}}, topic: {{$labels.topic}}, partition: {{$labels.partition}} 消费速度是昨天的{{$value}}, 同时,延迟量增加1倍以上(同比昨天)。"
shengbinxu commented 7 months ago

Additionally, scrape_timeout must be set to a sufficient size (default 5s), please refer to the link for specific reasons link

- job_name: 'kafka'
    scrape_interval: 2m
    scrape_timeout: 2m
    static_configs:
      - targets: ['10.0.0.161:9308']