aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.62k stars 922 forks source link

Metric karpenter_interruption_received_messages(message_type="SpotInterruptionKind") is not accurate #6531

Open hitsub2 opened 2 months ago

hitsub2 commented 2 months ago

Description

Observed Behavior: Currently Karpenter recieves the account all the spot interruption and filters in the karpenter controller logic. But we have setup a lambda to filter the interruption message and sends to the related sqs. So karpenter only receieves the spot interruption belongs to it.

When taking care of the spot interruption, Karpenter emits the metric karpenter_interruption_received_messages(message_type="SpotInterruptionKind") + 2 for every single spot interruption. For example, if there is one spot interruption, this metric value is 2.

Expected Behavior:

karpenter_interruption_received_messages(message_type="SpotInterruptionKind") = how many spot instance is interrupted

Reproduction Steps (Please include YAML):

Versions:

jigisha620 commented 1 month ago

I have been trying to reproduce the issue but I haven't seen this behavior occur even once in my testing. Do you have karpenter controller logs from when this happened? Specifically looking for log with message initiating delete from interruption message.

hitsub2 commented 1 month ago

karpenter (4).log

It is very easy to reproduce this issue. I just tested it with FIS, the metric is always + 2 for every single spot interruption.

jigisha620 commented 1 month ago

I spent more time trying to reproduce this issue using FIS, just like you mentioned. Every time I only got one event. These are metrics from prometheus

Screenshot 2024-08-12 at 11 49 22 AM

And here's the grafana dashboard

Screenshot 2024-08-12 at 11 49 46 AM