Open runningman84 opened 1 year ago
What scale are you running the cluster at? Also, what are the resources for the karpenter deployment?
This is a medium size cluster:
$ kubectl get nodes | wc -l
39
$ kubectl get pods --all-namespaces | wc -l
731
Karpenter is running in fargate with these settings:
│ Limits:
│ cpu: 200m
│ memory: 768Mi
│ Requests:
│ cpu: 200m
│ memory: 768Mi
It normally needs about 100m cpu and about 400MB memory.
We just increased the cpu requests to 500m and removed the cpu limits. The message still appears from time to time...
Karpenter is recommended to run at 1 cpu and 1Gi of memory for both the limit and requests.
Karpenter is recommended to run at 1 cpu and 1Gi of memory for both the limit and requests.
not sure where you have taken that information. any value suggested will always depend on the workload
Kubernetes best practices recommends against CPU limits too.
@FernandoMiguel You make a good. I meant more that by default karpenter sets the requests for cpu to be 1. Seen here on the getting started guide of the helm command.
That's outdated. If you look into the values.yaml those are now commented out.
If you look into the values.yaml those are now commented out.
You're right @FernandoMiguel. We've yet to post general guidance related to pod count/node count and how that relates to cpu/memory usage. This is on our bucket list of things to do, but up to this point the cpu/memory requests/limits that users use is very user-specific and anecdotal.
Yep. Everyone needs to monitor their workloads and adjust resources based on what they are consuming. And don't set CPU limits!
@runningman84 are still experiencing this issue?
Yes
Can you check your rate(container_cpu_usage_seconds_total[5m])
metrics during the times where you see this message as well as your rate(container_cpu_cfs_throttled_seconds_total[5m])
to see if you are getting throttled and this is causing the "slow event handlers blocking the queue". I think generally, looking at other issues, this seems to be a resource requests issue where the process needs more cpu resources than it has, so it gets throttled, causing pulling of the client-go FIFO queue to be slower than expected.
A quick overview of what I believe is happening here and then maybe we can theorize around the best ways to fix this problem:
For()
registration (for controllers that watch and react to the same resource) and the handler.EnqueueRequestFromMapFunc
which registers a handler on an informer and then maps that to a specific request on the controller-runtime workqueue. What we believe is happening here is that there are both a large number of handlers registered against a given resource and handlers that call out to the client-go cache that are taking too long to respond. We need to dig deeper into how the number of handlers registered for the informer (and what that enqueue mechanism) is doing is affecting the informer performance.
For now, we consider these log lines relatively benign and should go away as you increase the number of CPU cycles that are allocated to the process.
I am facing similar error logs:
Details We were recently load testing Karpenter, and one of the scenarios we were trying was to scale the number of pods on a deployment from 10 to 200. After a while we started seeing the below errors in Karpenter deployment logs and the number of deployments was stuck at around 150 ready pods for a long while and it kept fluctuating until it finally reached the needed number after some compromises, meanwhile, nodes where being deprovisioned/provisioned very frequently.
The cluster we are testing on is 3 Karpenter managed nodes on average, It has a managed node group of 4. We have around 185 pods this is a development cluster this number should rise as we go, on other testing clusters we have around 576+.
We are using
I0502 17:56:45.489580 1 trace.go:219] Trace[1197403500]: "DeltaFIFO Pop Process" ID:cert-manager/cert-manager-csi-driver-nbbvx,Depth:14,Reason:slow event handlers blocking the queue (02-May-2023 17:56:45.341) (total time: 148ms):
Trace[1197403500]: [148.358607ms] [148.358607ms] END
@nalshamaajc, check out @jonathan-innis's comments above on CPU requests for Karpenter. Are you monitoring it's CPU utilization?
@ellistarn No not yet, I was considering adding CPU after checking CPU utilization if that was being collected by our monitoring platform. I understand that you wish to see if CPU is reaching it's threshold to prove the above assumption of CPU being the botelneck am I right?
Yep, that sounds correct. It would be good to see metrics on CPU throttling to see if that's what we are hitting here.
One thing that I still wonder about is that if we don't set resource requests and limits on K8s the pods created by the deployment will have no resource constraints. This means that the pods will have unlimited access to CPU and memory resources on the nodes where they are scheduled.
In other words, why will CPU be throttled in my case if no limits are set in the first place?
FYI @ellistarn @jonathan-innis I reproduced the errors before setting any limits or requests on CPU and didn't notice any CPU throttling happening. There was definitely an increase in CPU utilization but nothing crazy on the container level as you can see below. These are two screen shots where one shows the metrics on both host and container and the second shows the results on the container alone. The point I'm showing the percentage at in the graph should be the exact point the log was reported at.
It is worth mentioning though that the host had high CPU utilization reaching +93% around 10 minutes before the log was reported (last screenshot)
[EDIT]
PS:
During the test, it was discovered that the topologySpreadConstraints
with hostname can create issues during scale-down, as the constraint compels Karpenter to provision more nodes to ensure that each pod is scheduled on its own node. The hostname constraint scheduled many nodes to satisfy its requirements. However, when the hostname constraint was removed, scaling in/out happened faster and without errors, mostly using the same node size as previous deployments with only five nodes (four of type c6a.4xlarge and one of type c6a.8xlarge), as opposed to around 64 nodes when the condition was set. Moreover, scaling down was significantly faster, and no error logs were generated. It appears that the constraint was exerting additional pressure on Karpenter to fulfill its requirements.
The below log was excessively being generated when the below condition was set
{"level":"DEBUG","time":"2023-05-09T19:43:09.750Z","logger":"controller.deprovisioning","message":"relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {\"maxSkew\":1,\"topologyKey\":\"kubernetes.io/hostname\",\"whenUnsatisfiable\":\"ScheduleAnyway\",\"labelSelector\":{\"matchLabels\":{\"app\":\"pir-test\"}}}","commit":"7131be2-dirty","pod":"default/pir-test-5c4748cc6d-9zq8b"}
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname # makes the node a domain of itself
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: pir-test
Interesting data. I wouldn't be surprised if this was a noisy neighbor issue given the node was under contention. It's hard to know more without reproducing myself. Will leave this open as some future performance work from our side.
Hello I ran some tests with the below conditions
topologySpreadConstraints
with hostname
key and a scheduleAnyway conditionI had the below outcomes:
relaxing soft constraints
for the hostname topologySpreadConstraints
(Image A and B)
slow event handlers blocking the queue
logsc6a.4xlarge
In the second test I removed the hostname condition and kept the AZ condition and the rest of the conditions remained the same. I had the below outcomes:
c6a.4xlarge
The below image also shows the metrics readings at the time the slow event handlers blocking the queue
errors were being generated.
you can find below the deployment manifest I used in both of my tests (the commented part was uncommented in the first test)
The last image shows the metrics at the time the error log was generated.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: pir-test
name: pir-test
spec:
replicas: 10
selector:
matchLabels:
app: pir-test
template:
metadata:
labels:
app: pir-test
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # doesn't match our current setup #ScheduleAnyway
labelSelector:
matchLabels:
app: pir-test
# - maxSkew: 1
# topologyKey: kubernetes.io/hostname # makes the node a domain of itself
# whenUnsatisfiable: ScheduleAnyway
# labelSelector:
# matchLabels:
# app: pir-test
containers:
- image: public.ecr.aws/nginx/nginx
name: nginx
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
nodeSelector:
karpenter.sh/provisioner-name: default
Below are the requirements in my only provisioner
- key: topology.kubernetes.io/zone
operator: In
values:
- us-west-2a
- us-west-2b
- us-west-2c
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- c
- m
- r
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values:
- "5"
- key: karpenter.k8s.aws/instance-cpu
operator: In
values:
- "8"
- "16"
- "24"
- "32"
- "36"
- key: karpenter.k8s.aws/instance-size
operator: In
values:
- 4xlarge
- 6xlarge
- 8xlarge
Did one more round of testing while setting while setting CPU to 3vCPUs guaranteed QoS. I was still getting Limits error logs but I didn't notice any slow event handlers blocking the queue
VcpuLimitExceeded: You have requested more vCPU capacity than your current vCPU limit of 1663 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.
I still have this error
I0704 12:36:23.401644 1 trace.go:219] Trace[292543710]: "DeltaFIFO Pop Process",Depth:37,Reason:slow event handlers blocking the queue (04-Jul-2023 12:36:23.297) (total time: 104ms):
@dsouzajaison this could be a resources issue on the Karpenter deployment (mainly CPU) try adding more CPU to your deployment resource requests.
FWIW, I've received the same message (trace.go ... "DeltaFIFO Pop Process" ... Reason:slow event handlers blocking the queue) from cluster-autoscaler on EKS, and also seen it reported from completely different applications on other cloud providers. Maybe it's an upstream issue with golang k8s SDKs?
Version
Karpenter Version: v0.27.2
Kubernetes Version: v1.24.10
Expected Behavior
na
Actual Behavior
We see a lot of messages like this: DeltaFIFO Pop Process" ID:header-service/header-service-7c6995b78b-5zmnp,Depth:14,Reason:slow event handlers blocking the queue (18-Apr-2023 10:15:25.363) What should we do to fix this problem? Is this a bug?
Steps to Reproduce the Problem
Install karpenter in a cluster
Resource Specs and Logs
Community Note