aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.73k stars 942 forks source link

ThrottlingException: discovering amis from ssm, getting ssm parameter #5907

Open den-is opened 7 months ago

den-is commented 7 months ago

Description

What problem are you trying to solve? One of my clusters has a few hundred nodes, and a several thousand pods. This is autoscaled to a few thousand nodes and many more pods.

I use amiFamily:AL2 instead of specific amiSelectorTerms: []

With just ~400 nodes, in Karpenter logs I started to see hundreds of such messages:

discovering amis from ssm, getting ssm parameter "/aws/service/eks/optimized-ami/1.23/amazon-linux-2-gpu/recommended/image_id", ThrottlingException: Rate exceeded
    status code: 400, request id: xxxxx

Also:

Reconciler error
getting drift, calculating ami drift, no amis exist given constraints

How important is this feature to you? I will try to fix the issue by requesting a higher rate limit. But for which SSM service, what is the name of this limit? At the moment of writing, I was not able to identify which SSM service limit corresponds to the above issue.

Feature request Is it possible to add simple counter metrics on how many requests Karpenter is doing to SSM Parameter Store.


njtran commented 7 months ago

How many EC2NodeClasses do you have? The API is aws ssm get-parameter.

den-is commented 7 months ago

How many EC2NodeClasses do you have? The API is aws ssm get-parameter.

I understand that it is aws ssm get-parameter - I meant I was looking for name of this ResourceLimit in the Resources Limits dashboard, to understand what to increase.

Just 2 simple EC2NodeClasses - for AL2 and AL2023

njtran commented 7 months ago

how often do the logs fire? can you post them here?

den-is commented 7 months ago

how often do the logs fire? can you post them here?

@njtran I can't post them here. I have DEBUG Logs enabled. But there is nothing in logs except, text which I have posted above (I'm excluding usual/info messages) Error level Reconciler error happens hundreds of times per hour. discovering amis from ssm happened much less but still a dozen times per hour.

That was happening intensively for more than 12h, during peak hours, on non-prod server (but in ACC which maybe had some other apps/tests running)

Including screenshot with funny numbers

image
njtran commented 7 months ago

IIUC, reconciler error could come from any of the controllers.

If you don't want to post the logs here, if you're willing to open an AWS Support ticket with the info or message me on the kubernetes slack, happy to look at the logs there.

I'm really more curious about how often the discovering amis from ssm error comes so I can gauge if we're doing more SSM lookups than expected.

den-is commented 7 months ago
image
njtran commented 7 months ago

So it looks like 4 logs every 10 minutes?