aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.62k stars 923 forks source link

Message "slow event handlers blocking the queue" #3773

Open runningman84 opened 1 year ago

runningman84 commented 1 year ago

Version

Karpenter Version: v0.27.2

Kubernetes Version: v1.24.10

Expected Behavior

na

Actual Behavior

We see a lot of messages like this: DeltaFIFO Pop Process" ID:header-service/header-service-7c6995b78b-5zmnp,Depth:14,Reason:slow event handlers blocking the queue (18-Apr-2023 10:15:25.363) What should we do to fix this problem? Is this a bug?

Steps to Reproduce the Problem

Install karpenter in a cluster

Resource Specs and Logs

2023-04-18T10:19:52.170Z    INFO    controller.deprovisioning    deprovisioning via consolidation delete, terminating 1 machines ip-100-65-111-115.eu-central-1.compute.internal/c5d.xlarge/spot    {"commit": "d01ea11-dirty"}
2023-04-18T10:19:52.365Z    INFO    controller.termination    cordoned node    {"commit": "d01ea11-dirty", "node": "ip-100-65-111-115.eu-central-1.compute.internal"}
2023-04-18T10:19:55.415Z    INFO    controller.termination    deleted node    {"commit": "d01ea11-dirty", "node": "ip-100-65-111-115.eu-central-1.compute.internal"}
I0418 10:20:12.465949       1 trace.go:219] Trace[231923973]: "DeltaFIFO Pop Process" ID:infra-ingress-auth-service/ingress-auth-service-59d4d96dbb-lsxxc,Depth:16,Reason:slow event handlers blocking the queue (18-Apr-2023 10:20:12.364) (total time: 101ms):
Trace[231923973]: [101.12991ms] [101.12991ms] END
I0418 10:20:12.764013       1 trace.go:219] Trace[922268623]: "DeltaFIFO Pop Process" ID:xxx-cms-dev/discover-frontend-public-7886477494-n6sxs,Depth:18,Reason:slow event handlers blocking the queue (18-Apr-2023 10:20:12.578) (total time: 185ms):
Trace[922268623]: [185.782794ms] [185.782794ms] END
I0418 10:20:13.165725       1 trace.go:219] Trace[474204851]: "DeltaFIFO Pop Process" ID:gitlab-ci-template-fix-tags/gitlab-ci-template-5655f4df7-2b5dm,Depth:17,Reason:slow event handlers blocking the queue (18-Apr-2023 10:20:12.764) (total time: 401ms):
Trace[474204851]: [401.631907ms] [401.631907ms] END
2023-04-18T10:20:21.669Z    INFO    controller.deprovisioning    deprovisioning via consolidation delete, terminating 1 machines ip-100-65-214-196.eu-central-1.compute.internal/c5d.xlarge/spot    {"commit": "d01ea11-dirty"}
2023-04-18T10:20:21.773Z    INFO    controller.termination    cordoned node    {"commit": "d01ea11-dirty", "node": "ip-100-65-214-196.eu-central-1.compute.internal"}
2023-04-18T10:20:24.064Z    INFO    controller.termination    deleted node    {"commit": "d01ea11-dirty", "node": "ip-100-65-214-196.eu-central-1.compute.internal"}

Community Note

jonathan-innis commented 1 year ago

What scale are you running the cluster at? Also, what are the resources for the karpenter deployment?

runningman84 commented 1 year ago

This is a medium size cluster:

$ kubectl get nodes | wc -l
39
$ kubectl get pods --all-namespaces | wc -l
731

Karpenter is running in fargate with these settings:

│     Limits:                                                                                                                                                                                                                      
│       cpu:     200m                                                                                                                                                                                                              
│       memory:  768Mi                                                                                                                                                                                                             
│     Requests:                                                                                                                                                                                                                    
│       cpu:      200m                                                                                                                                                                                                             
│       memory:   768Mi 

It normally needs about 100m cpu and about 400MB memory.

runningman84 commented 1 year ago

We just increased the cpu requests to 500m and removed the cpu limits. The message still appears from time to time...

engedaam commented 1 year ago

Karpenter is recommended to run at 1 cpu and 1Gi of memory for both the limit and requests.

FernandoMiguel commented 1 year ago

Karpenter is recommended to run at 1 cpu and 1Gi of memory for both the limit and requests.

not sure where you have taken that information. any value suggested will always depend on the workload

Kubernetes best practices recommends against CPU limits too.

engedaam commented 1 year ago

@FernandoMiguel You make a good. I meant more that by default karpenter sets the requests for cpu to be 1. Seen here on the getting started guide of the helm command.

FernandoMiguel commented 1 year ago

That's outdated. If you look into the values.yaml those are now commented out.

jonathan-innis commented 1 year ago

If you look into the values.yaml those are now commented out.

You're right @FernandoMiguel. We've yet to post general guidance related to pod count/node count and how that relates to cpu/memory usage. This is on our bucket list of things to do, but up to this point the cpu/memory requests/limits that users use is very user-specific and anecdotal.

FernandoMiguel commented 1 year ago

Yep. Everyone needs to monitor their workloads and adjust resources based on what they are consuming. And don't set CPU limits!

engedaam commented 1 year ago

@runningman84 are still experiencing this issue?

runningman84 commented 1 year ago

Yes

jonathan-innis commented 1 year ago

Can you check your rate(container_cpu_usage_seconds_total[5m]) metrics during the times where you see this message as well as your rate(container_cpu_cfs_throttled_seconds_total[5m]) to see if you are getting throttled and this is causing the "slow event handlers blocking the queue". I think generally, looking at other issues, this seems to be a resource requests issue where the process needs more cpu resources than it has, so it gets throttled, causing pulling of the client-go FIFO queue to be slower than expected.

jonathan-innis commented 1 year ago

A quick overview of what I believe is happening here and then maybe we can theorize around the best ways to fix this problem:

  1. The Kubernetes informer client-go informer mechanism uses a FIFO mechanism to pull updates from the event watch, update the client-go cache, and call out to any handlers that are registered to be called due to any changes on the watch
  2. If calling out to the handlers takes too long, the informer will log that it took too long to run the handler as its popping things off the FIFO queue
  3. The handlers that are being registered as part of controller-runtime are the handlers registered in the Builder construct. These handler registrations include the For() registration (for controllers that watch and react to the same resource) and the handler.EnqueueRequestFromMapFunc which registers a handler on an informer and then maps that to a specific request on the controller-runtime workqueue.
  4. These handlers occur on the SharedIndexInformer, which means that for any specific Kubernetes resource (i.e. Pod, Deployment, etc.) there is one informer, one cache, and one set of handlers registered against it.

What we believe is happening here is that there are both a large number of handlers registered against a given resource and handlers that call out to the client-go cache that are taking too long to respond. We need to dig deeper into how the number of handlers registered for the informer (and what that enqueue mechanism) is doing is affecting the informer performance.

For now, we consider these log lines relatively benign and should go away as you increase the number of CPU cycles that are allocated to the process.

nalshamaajc commented 1 year ago

I am facing similar error logs:

Details We were recently load testing Karpenter, and one of the scenarios we were trying was to scale the number of pods on a deployment from 10 to 200. After a while we started seeing the below errors in Karpenter deployment logs and the number of deployments was stuck at around 150 ready pods for a long while and it kept fluctuating until it finally reached the needed number after some compromises, meanwhile, nodes where being deprovisioned/provisioned very frequently.

The cluster we are testing on is 3 Karpenter managed nodes on average, It has a managed node group of 4. We have around 185 pods this is a development cluster this number should rise as we go, on other testing clusters we have around 576+.

We are using

I0502 17:56:45.489580       1 trace.go:219] Trace[1197403500]: "DeltaFIFO Pop Process" ID:cert-manager/cert-manager-csi-driver-nbbvx,Depth:14,Reason:slow event handlers blocking the queue (02-May-2023 17:56:45.341) (total time: 148ms):
Trace[1197403500]: [148.358607ms] [148.358607ms] END
ellistarn commented 1 year ago

@nalshamaajc, check out @jonathan-innis's comments above on CPU requests for Karpenter. Are you monitoring it's CPU utilization?

nalshamaajc commented 1 year ago

@ellistarn No not yet, I was considering adding CPU after checking CPU utilization if that was being collected by our monitoring platform. I understand that you wish to see if CPU is reaching it's threshold to prove the above assumption of CPU being the botelneck am I right?

jonathan-innis commented 1 year ago

Yep, that sounds correct. It would be good to see metrics on CPU throttling to see if that's what we are hitting here.

nalshamaajc commented 1 year ago

One thing that I still wonder about is that if we don't set resource requests and limits on K8s the pods created by the deployment will have no resource constraints. This means that the pods will have unlimited access to CPU and memory resources on the nodes where they are scheduled.

In other words, why will CPU be throttled in my case if no limits are set in the first place?

nalshamaajc commented 1 year ago

FYI @ellistarn @jonathan-innis I reproduced the errors before setting any limits or requests on CPU and didn't notice any CPU throttling happening. There was definitely an increase in CPU utilization but nothing crazy on the container level as you can see below. These are two screen shots where one shows the metrics on both host and container and the second shows the results on the container alone. The point I'm showing the percentage at in the graph should be the exact point the log was reported at.

It is worth mentioning though that the host had high CPU utilization reaching +93% around 10 minutes before the log was reported (last screenshot) image

image

image

[EDIT]

PS:

During the test, it was discovered that the topologySpreadConstraints with hostname can create issues during scale-down, as the constraint compels Karpenter to provision more nodes to ensure that each pod is scheduled on its own node. The hostname constraint scheduled many nodes to satisfy its requirements. However, when the hostname constraint was removed, scaling in/out happened faster and without errors, mostly using the same node size as previous deployments with only five nodes (four of type c6a.4xlarge and one of type c6a.8xlarge), as opposed to around 64 nodes when the condition was set. Moreover, scaling down was significantly faster, and no error logs were generated. It appears that the constraint was exerting additional pressure on Karpenter to fulfill its requirements.

The below log was excessively being generated when the below condition was set

{"level":"DEBUG","time":"2023-05-09T19:43:09.750Z","logger":"controller.deprovisioning","message":"relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {\"maxSkew\":1,\"topologyKey\":\"kubernetes.io/hostname\",\"whenUnsatisfiable\":\"ScheduleAnyway\",\"labelSelector\":{\"matchLabels\":{\"app\":\"pir-test\"}}}","commit":"7131be2-dirty","pod":"default/pir-test-5c4748cc6d-9zq8b"}
topologySpreadConstraints:
  - maxSkew: 1
          topologyKey: kubernetes.io/hostname # makes the node a domain of itself
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: pir-test
ellistarn commented 1 year ago

Interesting data. I wouldn't be surprised if this was a noisy neighbor issue given the node was under contention. It's hard to know more without reproducing myself. Will leave this open as some future performance work from our side.

nalshamaajc commented 1 year ago

Hello I ran some tests with the below conditions

I had the below outcomes:

In the second test I removed the hostname condition and kept the AZ condition and the rest of the conditions remained the same. I had the below outcomes:

The below image also shows the metrics readings at the time the slow event handlers blocking the queue errors were being generated. you can find below the deployment manifest I used in both of my tests (the commented part was uncommented in the first test)

The last image shows the metrics at the time the error log was generated.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: pir-test
  name: pir-test
spec:
  replicas: 10
  selector:
    matchLabels:
      app: pir-test
  template:
    metadata:
      labels:
        app: pir-test
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule # doesn't match our current setup #ScheduleAnyway
        labelSelector:
          matchLabels:
            app: pir-test
#      - maxSkew: 1
#        topologyKey: kubernetes.io/hostname # makes the node a domain of itself
#        whenUnsatisfiable: ScheduleAnyway
#        labelSelector:
#          matchLabels:
#            app: pir-test
      containers:
      - image: public.ecr.aws/nginx/nginx
        name: nginx
        resources:
          requests:
            memory: "64Mi"
            cpu: "250m"
          limits:
            memory: "128Mi"
            cpu: "500m"
      nodeSelector:
        karpenter.sh/provisioner-name: default

Below are the requirements in my only provisioner

 - key: topology.kubernetes.io/zone
    operator: In
    values:
    - us-west-2a
    - us-west-2b
    - us-west-2c
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  - key: karpenter.k8s.aws/instance-category
    operator: In
    values:
    - c
    - m
    - r
  - key: karpenter.k8s.aws/instance-generation
    operator: Gt
    values:
    - "5"
  - key: karpenter.k8s.aws/instance-cpu
    operator: In
    values:
    - "8"
    - "16"
    - "24"
    - "32"
    - "36"
  - key: karpenter.k8s.aws/instance-size
    operator: In
    values:
    - 4xlarge
    - 6xlarge
    - 8xlarge

image

image

image

image

nalshamaajc commented 1 year ago

Did one more round of testing while setting while setting CPU to 3vCPUs guaranteed QoS. I was still getting Limits error logs but I didn't notice any slow event handlers blocking the queue

VcpuLimitExceeded: You have requested more vCPU capacity than your current vCPU limit of 1663 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.
dsouzajaison commented 1 year ago

I still have this error I0704 12:36:23.401644 1 trace.go:219] Trace[292543710]: "DeltaFIFO Pop Process",Depth:37,Reason:slow event handlers blocking the queue (04-Jul-2023 12:36:23.297) (total time: 104ms):

nalshamaajc commented 1 year ago

@dsouzajaison this could be a resources issue on the Karpenter deployment (mainly CPU) try adding more CPU to your deployment resource requests.

benjimin commented 2 months ago

FWIW, I've received the same message (trace.go ... "DeltaFIFO Pop Process" ... Reason:slow event handlers blocking the queue) from cluster-autoscaler on EKS, and also seen it reported from completely different applications on other cloud providers. Maybe it's an upstream issue with golang k8s SDKs?