MillisBehindLatest metric across _all_ shards

usrenmae commented 6 years ago

Currently several metrics, including MillisBehindLatest are reported to CloudWatch with one of the dimensions being a shard id. On the other side we find it very convenient to set CloudWatch alarms on top of this metric to be able to react, if any shard starts to lag behind. Now it is not possible to set up alerts without specifying the exact name of the shard. This is a limiting factor, because once you add and remove shards constantly, the shard names are being very dynamic and each time they change, you need to change the alarms accordingly, which is frustrating. In general as one want to react to any shard lagging behind, it would be very nice to have a global MillisBehindLatest without relating it to any shard in its dimensions. This can be the maximum across all shards, like MaxMillisBehindLatest.

sahilpalvia commented 6 years ago

Kinesis does emit a Stream level metrics for iterator age, called GetRecords.IteratorAgeMillis. You should be able to setup alarm on that metric. That metric can be found under the Kinesis namespace in CloudWatch. If you set the statistic for that metric to Maximum it'll map the maximum millisBehindLatest from all the shards for that given period. Please feel free to reopen the issue, if you still have questions.

usrenmae commented 6 years ago

Thanks for informing about the GetRecords.IteratorAgeMilliseconds metric. I wasn't aware of this one. After a closer look into it I figured out it's a global per-stream metric of the Kinesis service. What I'm interested in is a per-consumer metric. We have multiple consumers running on the same stream, some of them may catch up the event feed perfectly, but others may lag. My idea was to have a metric which can tell you which particular consumer is lagging behind. It's not possible to get this information out of the GetRecords.IteratorAgeMilliseconds metric of Kinesis stream itself, but KCL could provide this metric similar way it provides the MillisBehindLatest, but without the shardId dimension. Actually it is not convenient at all to have automation built around any shard-specific metrics, as shards are very dynamic on their own and may change in time, considering the fact that it is not possible to have an alarm on a metric with dimensions, but not specifying the dimension value. When monitoring is build on per-consumer basis, it's much more useful: one can setup permanent alarms on it and only in case of incident it's possible to trace back the particular shard with the shard-specific metrics already. Please re-opening the issue as suggested above.

sahilpalvia commented 6 years ago

Thank you for the feedback. We agree with the change you have suggested, and will prioritize it accordingly against the other customer requests we receive.

StevenYCChou commented 6 years ago

@sahilpalvia I also have same use case which we want to scale up/down based on how fast KCL application consumes. this metric will be helpful.

ghost commented 6 years ago

We have a similar use case and would like this metric as well. We have two kcl consumers on the same kinesis stream. One has a low threshold requirement while the other has a much higher threshold of latency.

We've set the alarm at the lower threshold on the stream, but it alarms once or twice a day because of the higher latency kcl consumer. We have to treat it as an alarm situation each time which obviously causes a lot of time wasted.

We've considered using the shard level metric, however being on the limit of max alarms allowed and having a 60 shard stream, that is not possible currently.

akumariiit commented 5 years ago

@sahilpalvia we also have exact same use case, can you provide any update on this?

pfifer commented 5 years ago

We don't have an update at this time. This is a feature we are interested adding, and will prioritize it with all customer requests.

For all of those interested can you please post a reaction on the parent post, this will assist us in prioritizing customer requests.

waffleshop commented 5 years ago

+1

vinujan59 commented 5 years ago

+1

vik7 commented 5 years ago

+1

akumariiit commented 5 years ago

+1

rkass commented 5 years ago

+1

winty56 commented 5 years ago

+1 We have more than 500 shards in Kinesis and more than 4 KCL application using same Kinesis. In AWS Cloudwatch console, we can not search all shard because Console search result limit is 500. so we do not use KCL Metrics. Although the number of indicators we can graph at one time is limited to 100 in console. This feature is essential for me to check lag of each KCL Application.

kaisermario commented 3 years ago

+1

kaisermario commented 3 years ago

@pfifer Any update?

MeisterMasi commented 3 years ago

+1

CCBow-501 commented 3 years ago

+1

yasemin-amzn commented 3 years ago

Hello,

There are service side metrics emitted for monitoring stream-level behind-ness. For consumers using GetRecords, "GetRecords.IteratorAgeMilliseconds" metric will be emitted and all consumer applications will be contributing to this metric. Consumer applications using enhanced fanout will be emitting "SubscribeToShardEvent.MillisBehindLatest" metric along with the consumer name, so status of each consumer can be monitored individually.

Consider using these metrics as an alternative to client-side metrics for monitoring application health.

For more details please refer to: https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-cloudwatch.html

kaisermario commented 3 years ago

Hello @yasemin-amzn , "SubscribeToShardEvent.MillisBehindLatest" is a basic (stream level) metric according to: https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-cloudwatch.html

Stream-level data is sent automatically every minute at no charge.

Unfortunately we can't see this metric in our account.

leifbladt commented 3 years ago

+1

QwertV2 commented 2 years ago

+1

awslabs / amazon-kinesis-client

MillisBehindLatest metric across _all_ shards #249