Open jan-ludvik opened 1 month ago
Can you show the logs of when Karpenter detected and consumed the event from the interruption queue? It seems odd to me that the instance/node would be deleted 11 days prior to its actual stoppage. Do you also have the stoppage event?
I have only info logs but this is the log message. I removed tags from the message.
{
"id": "AgAAAZHsebavxXYuzAAAAAAAAAAYAAAAAEFaSHNlY0UwQUFETGw1ek56T3E3TVFBQQAAACQAAAAAMDE5MWVjOGEtYjhmZC00YjlhLWE4ZjQtYmU2MTk2OWMxM2Ex",
"content": {
"timestamp": "2024-09-13T17:41:25.295Z",
"tags": [
<redacted>
],
"host": "i-0c9fc3bbd44946e55",
"service": "karpenter",
"message": "initiating delete from interruption message",
"attributes": {
"dd": {
"service": "karpenter"
},
"controller": "interruption",
"k8s_namespace": "kube-system",
"level": "INFO",
"logger": "controller",
"Node": {
"name": "ip-10-67-83-223.us-west-2.compute.internal"
},
"commit": "b897114",
"messageKind": "scheduled_change",
"reconcileID": "60ac6ee0-d5e9-4187-afa0-d27dad135411",
"namespace": "",
"name": "",
"action": "CordonAndDrain",
"time": "2024-09-13T17:41:24.298Z",
"NodeClaim": {
"name": "rabbitmq-b7sdk"
},
"queue": "app1d-karpenter"
}
}
}
I think I might have found the event in Datadog. We have it with source amazon_ec2
so it might be it or Datadog got it from some other place. Datadog has this a minute later than karpenter got the message. Not sure where that came from.
{
"id": "AgAAAZHsetaYX9mjzAAAAAAAAAAYAAAAAEFaSHNldUVMQUFBcXR5c29lYjJPU2hBQQAAACQAAAAAMDE5MWVjN2MtNWI4NC00YTQxLTliMGQtNTJiY2Y4OGQxMzVm",
"content": {
"timestamp": "2024-09-13T17:42:39Z",
"tags": [
<redacted>
],
"host": "ip-10-67-83-223.us-west-2.compute.internal-app1d",
"service": "undefined",
"message": "%%%\nThe instance is running on degraded hardware\n\ninstance-stop will automatically happen after 2024-09-27 18:00:00 if not manually run before. [aws guide](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html#schedevents_actions)\n%%%",
"attributes": {
"evt": {
"id": "7750847193401411082",
"source_id": 88,
"type": "status"
},
"hostname": "i-07c6f88c9383a3936",
"service": "undefined",
"priority": "normal",
"title": "Upcoming AWS maintenance event instance-stop on instance i-07c6f88c9383a3936",
"timestamp": 1726249359000,
"status": "info"
}
}
}
And this we have in datadog from amazon_health
{
"id": "AgAAAZHseanQ1LzFsQAAAAAAAAAYAAAAAEFaSHNnSjhRQUFELWZudnVwSEFXcGNvdAAAACQAAAAAMDE5MWVjN2MtZmQ1Zi00MDVjLTg5NDQtMjAxZGUyN2FlMDQy",
"content": {
"timestamp": "2024-09-13T17:41:22Z",
"tags": [
<redacted>
],
"host": "ip-10-67-83-223.us-west-2.compute.internal-app1d",
"service": "ec2",
"message": "%%%\nDescription: \n```\n\"EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance associated with this event in the us-west-2 region. Due to this degradation your instance could already be unreachable. We will stop your instance after 2024-09-27 18:00:00 UTC. Please take appropriate action before this time. You can find more information about retirement events scheduled for your EC2 instances in the AWS Management Console https://console.aws.amazon.com/ec2/v2/home?region=us-west-2#Events * What will happen to my instance? Your instance will be stopped after the specified retirement date. You can start it again at any time after it s stopped. Any data on local instance-store volumes will be lost when the instance is stopped or terminated. * What do I need to do? We recommend that you stop and start the instance which will migrate the instance to a new host. Please note that any data on your local instance-store volumes will not be preserved when you stop and start your instance. For more information about stopping and starting your instance, and what to expect when your instance is stopped, such as the effect on public, private and Elastic IP addresses associated with your instance, see Stop and Start Your Instance in the EC2 User Guide (https://docs.aws.amazon.com/en_us/AWSEC2/latest/UserGuide/Stop_Start.html). However, if you do not need this instance, you can stop it at any time yourself or wait for EC2 to stop it after the retirement date. * Why is EC2 retiring my instance? EC2 may schedule instances for retirement in cases where there is an unrecoverable issue with the underlying hardware. For more information about scheduled retirement events please see the EC2 user guide (https://docs.aws.amazon.com/en_us/AWSEC2/latest/UserGuide/instance-retirement.html). If you have questions or issues, contact AWS Support at: https://aws.amazon.com/support .\"\n```\n\n\nAffected Entities: \n```\n[\n {\n \"entityArn\": \"arn:aws:health:us-west-2:182192988802:entity/g1pf6QbeUGUTonkitSbPFG_dx5RL4w0SWzP-b_WKq6djY=1g\",\n \"eventArn\": \"arn:aws:health:us-west-2::event/EC2/AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED/AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED_4f62b4a3-18ef-4c1c-b87d-b2303fec0174\",\n \"entityValue\": \"i-07c6f88c9383a3936\",\n \"awsAccountId\": \"182192988802\",\n \"lastUpdatedTime\": \"2024-09-13T17:41:22.601000+00:00\",\n \"statusCode\": \"IMPAIRED\"\n }\n]\n```\n\n\nEvent start time : 2024-09-27 18:00:00 UTC\nEvent end time : 2024-09-27 18:00:00 UTC\n\n%%%",
"attributes": {
"aggregation_key": "arn:aws:health:us-west-2::event/EC2/AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED/AWS_EC2_PERS...",
"evt": {
"id": "7750853545852686014",
"source_id": 201,
"type": "api"
},
"hostname": "i-07c6f88c9383a3936",
"service": "ec2",
"priority": "normal",
"title": "AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED --- upcoming",
"event_object": "arn:aws:health:us-west-2::event/EC2/AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED/AWS_EC2_PERS...",
"timestamp": 1726249282000,
"status": "info"
}
}
}
The node was tainted
{
"id": "AgAAAZHsebavxXYuzQAAAAAAAAAYAAAAAEFaSHNlY0UwQUFETGw1ek56T3E3TVFBQgAAACQAAAAAMDE5MWVjOGEtYjhmZC00YjlhLWE4ZjQtYmU2MTk2OWMxM2Ex",
"content": {
"timestamp": "2024-09-13T17:41:25.295Z",
"tags": [
<redacted>
],
"host": "i-0c9fc3bbd44946e55",
"service": "karpenter",
"message": "tainted node",
"attributes": {
"dd": {
"service": "karpenter"
},
"controller": "node.termination",
"k8s_namespace": "kube-system",
"level": "INFO",
"logger": "controller",
"Node": {
"name": "ip-10-67-83-223.us-west-2.compute.internal"
},
"commit": "b897114",
"reconcileID": "cc4d8735-b018-40f3-8a48-c44e21a01897",
"taint": {
"Value": "",
"Effect": "NoSchedule",
"Key": "karpenter.sh/disrupted"
},
"namespace": "",
"name": "ip-10-67-83-223.us-west-2.compute.internal",
"controllerGroup": "",
"time": "2024-09-13T17:41:24.331Z",
"controllerKind": "Node"
}
}
}
And karpenter found the provisionable pod from this node
{
"id": "AgAAAZHsepGuvO9T_gAAAAAAAAAYAAAAAEFaSHNlcGd5QUFCenZvNUVNNmMyM3dBQQAAACQAAAAAMDE5MWVjOGEtYjhmZC00YjlhLWE4ZjQtYmU2MTk2OWMxM2Ex",
"content": {
"timestamp": "2024-09-13T17:42:21.358Z",
"tags": [
<redacted>
],
"host": "i-0c9fc3bbd44946e55",
"service": "karpenter",
"message": "found provisionable pod(s)",
"attributes": {
"duration": "193.032533ms",
"dd": {
"service": "karpenter"
},
"controller": "provisioner",
"k8s_namespace": "kube-system",
"level": "INFO",
"logger": "controller",
"commit": "b897114",
"namespace": "",
"name": "",
"Pods": "rabbitmq--app1d/rabbitmq-0",
"time": "2024-09-13T17:42:21.052Z",
"reconcileID": "839a738f-4d2c-4180-9bac-47cba50e3b6a"
}
}
}
Then it created new nodeclaim but I think the biggest problem is that the pod actually was not rescheduled to new node (although it was deleted) but it ended up on this exact node again. Is it possible the taint was not there yet?
Description
Observed Behavior: We have statefulset running on c5d instance. It has
karpenter.sh/do-not-disrupt: "true"
annotation. Instance where it runs is supposed to be turned off by AWS in 11 days from now (instance-stop
event). It seems that when karpenter saw this it began deleting the instance withCordonAndDrain
action. That seems to have killed the pod (new one created at"2024-09-13T17:20:16Z"
). That itself is strange because theCordonAndDrain
event happened at17:41:25
(why 21 minutes later?)Timeline (UTC):
I understand this event cannot be avoided so honoring the
do-not-disrupt
might not be needed however what is strange now is the current status which is definitely not expected.What happenes seems to be that the new pod scheduled on the old node. This probably blocked the deletion. And while the new node is saying it is nominated for pending pod and it is not getting deleted and hangs around empty
Old node taints - karpenter taint is there so I don't know why the new pod still scheduled.
Expected Behavior:
Reproduction Steps (Please include YAML):
Versions:
Chart Version: 1.0.2
Kubernetes Version (
kubectl version
): 1.29Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment