Closed hybby closed 2 weeks ago
Could you share your full Karpenter logs? There should be message starting with "disrupting via" that proceeds any voluntary disruption. There should also be a different message before any spot interruption event, though Karpenter does try and preemptively remove the instance so you could see Karpenter performing the deletion in that scenario.
Hi @jmdeal - thanks for looking at this.
The strange thing about this particular event that I'm highlighting was that there was no disrupting via
, even though we do see those messages for earlier terminations. I'm interested to dig into scenarios where Karpenter would not log when it deletes a node and terminates an instance.
Please find the full Karpenter log for the period of 2024-05-16T02:.*
- I've had to redact certain information, but in all other regards the log is full as reported from Karpenter (it's effectively a grep '2024-05-16T02:' karpenter.log
). I hope that's okay. I've left partial node identifiers intact - the node in question that didn't log a termination event was ip-<redacted_subnet>-18-33
. This is instance i-0b94ea54e8444e6e6
.
{"level":"INFO","time":"2024-05-16T02:10:36.753Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"ip-<redacted_subnet>-18-33.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:10:37.073Z","logger":"controller.node.termination","message":"deleted node","commit":"fb4d75f","node":"ip-<redacted_subnet>-18-33.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:10:37.549Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-txnxx","node":"ip-<redacted_subnet>-18-33.<redacted_region>.compute.internal","provider-id":"aws:///<redacted_region>b/i-0b94ea54e8444e6e6"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-84785587cc-d5xcw\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-c76c7b59d-h95j7\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-77d97cb769-xjvc9\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-7dc96d4dcf-qd8mb\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-6b545fc7b-cwtt7\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-85f7ff8787-5xrwf\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-7f5cd875f7-9gfhn\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-78dcbd5fd7-bt5zt\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-89d7d77d8-ml2qh\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-c6678c9dd-nq8tb\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-56bcbb5ff5-dx9k7\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-78d8db59d7-zpsx5\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.585Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-85f7ff8787-5xrwf and 7 other(s)","duration":"81.176893ms"}
{"level":"INFO","time":"2024-05-16T02:10:40.585Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"fb4d75f","nodeclaims":1,"pods":12}
{"level":"INFO","time":"2024-05-16T02:10:40.596Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"fb4d75f","nodepool":"spot-b","nodeclaim":"spot-b-r4srl","requests":{"cpu":"6680m","memory":"11294Mi","pods":"20"},"instance-types":"c5.2xlarge, c5.4xlarge, c5a.2xlarge, c5a.4xlarge, c6a.2xlarge and 5 other(s)"}
{"level":"INFO","time":"2024-05-16T02:10:42.578Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-r4srl","provider-id":"aws:///<redacted_region>b/i-03f59bdfa13f2d20e","instance-type":"c5.2xlarge","zone":"<redacted_region>b","capacity-type":"spot","allocatable":{"cpu":"7910m","ephemeral-storage":"35Gi","memory":"14162Mi","pods":"58","vpc.amazonaws.com/pod-eni":"38"}}
{"level":"INFO","time":"2024-05-16T02:10:50.575Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-85f7ff8787-5xrwf and 7 other(s)","duration":"71.798101ms"}
{"level":"INFO","time":"2024-05-16T02:11:00.582Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-85f7ff8787-5xrwf and 7 other(s)","duration":"77.31605ms"}
{"level":"INFO","time":"2024-05-16T02:11:03.806Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-r4srl","provider-id":"aws:///<redacted_region>b/i-03f59bdfa13f2d20e","node":"ip-<redacted_subnet>-18-174.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:11:10.576Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-85f7ff8787-5xrwf and 7 other(s)","duration":"71.272592ms"}
{"level":"INFO","time":"2024-05-16T02:11:19.725Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-r4srl","provider-id":"aws:///<redacted_region>b/i-03f59bdfa13f2d20e","node":"ip-<redacted_subnet>-18-174.<redacted_region>.compute.internal","allocatable":{"cpu":"7910m","ephemeral-storage":"37569620724","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"14879196Ki","pods":"58"}}
{"level":"INFO","time":"2024-05-16T02:12:18.536Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"ip-<redacted_subnet>-12-76.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:12:19.281Z","logger":"controller.node.termination","message":"deleted node","commit":"fb4d75f","node":"ip-<redacted_subnet>-12-76.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:12:19.606Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-c2whp","node":"ip-<redacted_subnet>-12-76.<redacted_region>.compute.internal","provider-id":"aws:///<redacted_region>a/i-0f938d0d982eddb74"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-5cd5f4d9cd-2tdk2\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-84959f49bb-c6jpp\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-5598ddfb74-gjg4f\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-64876d869-fspnt\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-578648d474-6w72p\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-cbdf4f4f9-66q9g\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-844995f477-4mrmd\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-67ff785bd4-tpjp8\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-64c878f96b-zvp4k\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-67b9f669f8-r4rsp\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-c98cbc6cf-452wr\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.226Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-844995f477-4mrmd and 6 other(s)","duration":"131.457729ms"}
{"level":"INFO","time":"2024-05-16T02:12:24.226Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"fb4d75f","nodeclaims":1,"pods":11}
{"level":"INFO","time":"2024-05-16T02:12:24.253Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"fb4d75f","nodepool":"spot-b","nodeclaim":"spot-b-t82jt","requests":{"cpu":"6180m","memory":"11386Mi","pods":"19"},"instance-types":"c5.2xlarge, c5.4xlarge, c5a.2xlarge, c5a.4xlarge, c6a.2xlarge and 5 other(s)"}
{"level":"INFO","time":"2024-05-16T02:12:26.284Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-t82jt","provider-id":"aws:///<redacted_region>a/i-0c946df4c47d0b561","instance-type":"c5.2xlarge","zone":"<redacted_region>a","capacity-type":"spot","allocatable":{"cpu":"7910m","ephemeral-storage":"35Gi","memory":"14162Mi","pods":"58","vpc.amazonaws.com/pod-eni":"38"}}
{"level":"INFO","time":"2024-05-16T02:12:34.167Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-844995f477-4mrmd and 6 other(s)","duration":"71.301313ms"}
{"level":"INFO","time":"2024-05-16T02:12:44.168Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-844995f477-4mrmd and 6 other(s)","duration":"71.14042ms"}
{"level":"INFO","time":"2024-05-16T02:12:54.169Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-844995f477-4mrmd and 6 other(s)","duration":"72.239342ms"}
{"level":"INFO","time":"2024-05-16T02:12:54.692Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-t82jt","provider-id":"aws:///<redacted_region>a/i-0c946df4c47d0b561","node":"ip-<redacted_subnet>-12-14.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:13:04.168Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-844995f477-4mrmd and 6 other(s)","duration":"70.90903ms"}
{"level":"INFO","time":"2024-05-16T02:13:10.726Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-t82jt","provider-id":"aws:///<redacted_region>a/i-0c946df4c47d0b561","node":"ip-<redacted_subnet>-12-14.<redacted_region>.compute.internal","allocatable":{"cpu":"7910m","ephemeral-storage":"37569620724","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"14764516Ki","pods":"58"}}
{"level":"INFO","time":"2024-05-16T02:13:16.532Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-84785587cc-tfpwn\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:13:16.532Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-7f5cd875f7-rz8ds\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:13:16.532Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-6b545fc7b-5fv7w\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:13:16.532Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-c76c7b59d-jwntb\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:13:16.690Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-c76c7b59d-jwntb","duration":"160.230382ms"}
{"level":"INFO","time":"2024-05-16T02:13:16.690Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"fb4d75f","nodeclaims":1,"pods":4}
{"level":"INFO","time":"2024-05-16T02:13:16.700Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"fb4d75f","nodepool":"spot-b","nodeclaim":"spot-b-stspw","requests":{"cpu":"2680m","memory":"4918Mi","pods":"12"},"instance-types":"c5.2xlarge, c5.4xlarge, c5.xlarge, c5a.2xlarge, c5a.4xlarge and 15 other(s)"}
{"level":"INFO","time":"2024-05-16T02:13:18.510Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-stspw","provider-id":"aws:///<redacted_region>a/i-0cf064b31c2625a0f","instance-type":"c5.xlarge","zone":"<redacted_region>a","capacity-type":"spot","allocatable":{"cpu":"3920m","ephemeral-storage":"35Gi","memory":"6584Mi","pods":"58","vpc.amazonaws.com/pod-eni":"18"}}
{"level":"INFO","time":"2024-05-16T02:13:26.594Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-c76c7b59d-jwntb","duration":"62.883376ms"}
{"level":"INFO","time":"2024-05-16T02:13:36.596Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-c76c7b59d-jwntb","duration":"65.077591ms"}
{"level":"INFO","time":"2024-05-16T02:13:46.481Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-stspw","provider-id":"aws:///<redacted_region>a/i-0cf064b31c2625a0f","node":"ip-<redacted_subnet>-15-39.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:13:46.599Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-c76c7b59d-jwntb","duration":"67.23661ms"}
{"level":"INFO","time":"2024-05-16T02:13:56.605Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-c76c7b59d-jwntb","duration":"71.701185ms"}
{"level":"INFO","time":"2024-05-16T02:14:02.522Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-stspw","provider-id":"aws:///<redacted_region>a/i-0cf064b31c2625a0f","node":"ip-<redacted_subnet>-15-39.<redacted_region>.compute.internal","allocatable":{"cpu":"3920m","ephemeral-storage":"37569620724","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"6749636Ki","pods":"58"}}
Since the logs start with the node you called out, ip-<redacted_subnet>-18-33
, I can't tell if there was any previous indication. I did notice a similar event:
{"level":"INFO","time":"2024-05-16T02:12:18.536Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"ip-<redacted_subnet>-12-76.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:12:19.281Z","logger":"controller.node.termination","message":"deleted node","commit":"fb4d75f","node":"ip-<redacted_subnet>-12-76.<redacted_region>.compute.internal"}
If this was caused by voluntary disruption (e.g. consolidation, expiration, or drift) there would have been a log line from the disruption controller indicating the disruption decision and an additional log line from when the karpenter.sh/disruption
taint was applied by the disruption controller (we only see it applied by the termination controller here). Spot interruption would have also added a log line as well as published an event to the node. As far as I'm aware there isn't any form of disruption driven by Karpenter that should have this shape. Is it possible that some other process is deleting the nodes, triggering Karpenter's termination controller?
This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.
Description
Observed Behavior: Raised by request from AWS support as part of ongoing root cause analysis (see case ID: 171587130200061).
Our nodePools have both expiry (24h) and consolidation (WhenEmpty, 15m). They are configured to use spot instances.
I have observed some instance terminations that have been performed by the Karpenter controller that do not have a reason logged for the termination. This omission makes it difficult/impossible to troubleshoot certain reasons of instance termination caused by Karpenter.
We have also recently upgraded to v0.36.1 from v0.32.9, performing the necessary API objects migration as part of this - the issue only started occurring following this upgrade. At the time of the event noted below, we did not have an interruptionQueue configured due to a misconfiguration (a separate issue which has since been resolved).
It is not possible for me to determine why Karpenter terminated these nodes (it does not seem to obey either either the expiry or consolidation rules that were set up). I believe that a log message should have been emitted prior to the ec2:TerminateInstances call specifying why, which it was not in this case.
As a result of the short notice of this termination, Karpenter did not wait to allow graceful eviction of pods. This caused pods to terminate abruptly, causing some requests that were attempted to be routed to these pods to fail.
I do not think the termination was caused by SpotInstance reclaimation, because it was Karpenter that terminated the instances, not AWS - provable through CloudTrail.
An example, showing nodeclaim registration, a gap of less than a day (proving it was not terminated due to Expiry), then taint, node deletion and nodeclaim deletion:
I see this same node terminated by the Karpenter controller in CloudTrail, at exactly the same time:
Expected Behavior: I would expect a line prior to the taint and deletion telling me the reason for deletion. For example, for node expiry:
Reproduction Steps (Please include YAML): Noted as part of operations; I have not observed or been able to reproduce this outside of our running system.
See NodePool and NodeClass below:
NodePool:
EC2NodeClass
Versions:
Chart Version:
karpenter-0.36.1
Kubernetes Version (
kubectl version
):Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.12-eks-adc7111", GitCommit:"fbc0805ed3106b2a4e5bfcdea17957260b7027bc", GitTreeState:"clean", BuildDate:"2024-04-01T19:25:14Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"}
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment