Kafka Low Level Consumer (LLC) Offset Lag metrics

apache / pinot

Apache Pinot - A realtime distributed OLAP datastore

https://pinot.apache.org/

Apache License 2.0

5.39k stars 1.26k forks source link

Kafka Low Level Consumer (LLC) Offset Lag metrics #4172

Open chenboat opened 5 years ago

chenboat commented 5 years ago

What is the best metrics to monitoring LLC consumer offset lag for Kafka streams? I checked that there are two fields in [ServerGauge] about KAFKA_PARTITION_OFFSET_LAG and STREAM_PARTITION_OFFSET_LAG but they are not populated.

https://github.com/apache/incubator-pinot/blob/71a63c8fcf927fcb335f3ebc1988434eda0b8a38/pinot-common/src/main/java/org/apache/pinot/common/metrics/ServerGauge.java#L41

@npawar @mcvsubbu @Jackie-Jiang

haibow commented 5 years ago

Bumping this. In HLC we've been using kafka.consumer.ConsumerFetcherManager.MaxLag which seems absent for LLC.

snleee commented 5 years ago

We have added a freshness metric recently. I think that it's emitted per query request and indicates the min(freshness timestamp among segments processed)

I think that it's slightly different from what you have measured so far with HLC though (lag directly from kafka).

You can refer to the following issue: https://github.com/apache/incubator-pinot/issues/4007

@sunithabeeram Can you chime in?

snleee commented 5 years ago

Design doc: https://cwiki.apache.org/confluence/display/PINOT/Pinot+Freshness+Metric

haibow commented 5 years ago

Thanks @snleee for the info! Looks like this freshness metric would help. We'll try to pull in that change and test around.

sunithabeeram commented 5 years ago

@haibow the above design will work as long as kafka broker can provide "timestamp" per message. Specifically, at LinkedIn we rely on an API similar to this: https://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/ConsumerRecord.html#timestamp(); this is not available in 0.9 clients, but provided by the internal version of kafka.

haibow commented 5 years ago

@sunithabeeram thanks for the clarification. Will look into this.

w.r.t. freshness, how would you handle topics with low traffic rate (e.g. 1 message per 10s minutes, or just no data for a while)? Now we have alerts set up on ingestion lags; if we were to alert on this 'freshness' metric, sounds like false alarms are possible if freshness thresholds are not carefully tuned.

sunithabeeram commented 5 years ago

@haibow - would like to hear how the lag based alert is tuned on your end. Do you alert on non-zero value?

As noted in the design doc, I believe the kafka offset based lag has the issue with tuning as well; unless we always get an offset lag of 0, any other number needs to be interpreted based on the events per second expected on the topic.

For alerting based on the freshness metric, the SLA should be based on the upstream characteristics and we expect this to be order of minutes (under a minute would make it quite noisy); For ex, in some cases, we have an upstream samza job that runs hourly, so freshness can be quantified only up to that delay.

haibow commented 5 years ago

@sunithabeeram you are right - lag has the issues described in the design. On double check, we are just 'monitoring' but not 'alerting' on lag.

Originally I was wondering how to distinguish expected data staleness (e.g. stopped/windowed ingestion), but using dependent alerts on upstream traffic should help, i.e. if data is stale for X minutes, but input traffic was also flat for last X minutes, then the alert should be suppressed.