AWS scheduled stop with do-not-disrupt pod

jan-ludvik commented 1 month ago

Description

Observed Behavior: We have statefulset running on c5d instance. It has karpenter.sh/do-not-disrupt: "true" annotation. Instance where it runs is supposed to be turned off by AWS in 11 days from now (instance-stop event). It seems that when karpenter saw this it began deleting the instance with CordonAndDrain action. That seems to have killed the pod (new one created at "2024-09-13T17:20:16Z"). That itself is strange because the CordonAndDrain event happened at 17:41:25 (why 21 minutes later?)

Timeline (UTC):

7:20:16 pm - new statefulset pod
7:41:25.295 pm - initiating delete from interruption message
7:42:21.358 pm - created nodeclaim
7:42:23.359 pm - launched nodeclaim
7:42:56.454 pm - registered nodeclaim
7:43:19.523 pm - initialized nodeclaim

I understand this event cannot be avoided so honoring the do-not-disrupt might not be needed however what is strange now is the current status which is definitely not expected.

4 nodes (22471m/31640m) 71.0% cpu ████████████████████████████░░░░░░░░░░░░ $1.536/hour | $1,121.280/month 
1,203 pods (37 pending 1,166 running 1,203 bound)

ip-10-67-41-241.us-west-2.compute.internal cpu ████████████████████████████████░░░  91% (10 pods) c5d.2xlarge/$0.3840 On-Demand -        Ready rabbitmq 74d   
ip-10-67-18-152.us-west-2.compute.internal cpu ████████████████████████████████░░░  91% (10 pods) c5d.2xlarge/$0.3840 On-Demand -        Ready rabbitmq 74d   
ip-10-67-83-223.us-west-2.compute.internal cpu ████████████████████████████████░░░  91% (10 pods) c5d.2xlarge/$0.3840 On-Demand Deleting Ready rabbitmq 74d   
ip-10-67-92-31.us-west-2.compute.internal  cpu ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  10% (9 pods)  c5d.2xlarge/$0.3840 On-Demand -        Ready rabbitmq 2d19h

What happenes seems to be that the new pod scheduled on the old node. This probably blocked the deletion. And while the new node is saying it is nominated for pending pod and it is not getting deleted and hangs around empty

Old node taints - karpenter taint is there so I don't know why the new pod still scheduled.

Taints:             dedicated=rabbitmq:NoSchedule
                    karpenter.sh/disrupted:NoSchedule

Events:
  Type    Reason             Age                 From       Message
  ----    ------             ----                ----       -------
  Normal  DisruptionBlocked  95s (x30 over 62m)  karpenter  Cannot disrupt Node: state node is nominated for a pending pod

Expected Behavior:

Should Karpenter wait for the actual instance-stop event and emit metrics so we can alert on this? If this would be labeled by nodepool it would be great so I can filter alerts only for interesting nodepools.
If not and it will just kill the pod, I would expect the new pod to end up on the new node.

Reproduction Steps (Please include YAML):

Versions:

Chart Version: 1.0.2
Kubernetes Version (kubectl version): 1.29
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

njtran commented 1 month ago

Can you show the logs of when Karpenter detected and consumed the event from the interruption queue? It seems odd to me that the instance/node would be deleted 11 days prior to its actual stoppage. Do you also have the stoppage event?

jan-ludvik commented 1 month ago

I have only info logs but this is the log message. I removed tags from the message.

{
    "id": "AgAAAZHsebavxXYuzAAAAAAAAAAYAAAAAEFaSHNlY0UwQUFETGw1ek56T3E3TVFBQQAAACQAAAAAMDE5MWVjOGEtYjhmZC00YjlhLWE4ZjQtYmU2MTk2OWMxM2Ex",
    "content": {
        "timestamp": "2024-09-13T17:41:25.295Z",
        "tags": [
<redacted>
        ],
        "host": "i-0c9fc3bbd44946e55",
        "service": "karpenter",
        "message": "initiating delete from interruption message",
        "attributes": {
            "dd": {
                "service": "karpenter"
            },
            "controller": "interruption",
            "k8s_namespace": "kube-system",
            "level": "INFO",
            "logger": "controller",
            "Node": {
                "name": "ip-10-67-83-223.us-west-2.compute.internal"
            },
            "commit": "b897114",
            "messageKind": "scheduled_change",
            "reconcileID": "60ac6ee0-d5e9-4187-afa0-d27dad135411",
            "namespace": "",
            "name": "",
            "action": "CordonAndDrain",
            "time": "2024-09-13T17:41:24.298Z",
            "NodeClaim": {
                "name": "rabbitmq-b7sdk"
            },
            "queue": "app1d-karpenter"
        }
    }
}

jan-ludvik commented 1 month ago

I think I might have found the event in Datadog. We have it with source amazon_ec2 so it might be it or Datadog got it from some other place. Datadog has this a minute later than karpenter got the message. Not sure where that came from.

{
    "id": "AgAAAZHsetaYX9mjzAAAAAAAAAAYAAAAAEFaSHNldUVMQUFBcXR5c29lYjJPU2hBQQAAACQAAAAAMDE5MWVjN2MtNWI4NC00YTQxLTliMGQtNTJiY2Y4OGQxMzVm",
    "content": {
        "timestamp": "2024-09-13T17:42:39Z",
        "tags": [
<redacted>
        ],
        "host": "ip-10-67-83-223.us-west-2.compute.internal-app1d",
        "service": "undefined",
        "message": "%%%\nThe instance is running on degraded hardware\n\ninstance-stop will automatically happen after 2024-09-27 18:00:00 if not manually run before. [aws guide](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html#schedevents_actions)\n%%%",
        "attributes": {
            "evt": {
                "id": "7750847193401411082",
                "source_id": 88,
                "type": "status"
            },
            "hostname": "i-07c6f88c9383a3936",
            "service": "undefined",
            "priority": "normal",
            "title": "Upcoming AWS maintenance event instance-stop on instance i-07c6f88c9383a3936",
            "timestamp": 1726249359000,
            "status": "info"
        }
    }
}

jan-ludvik commented 1 month ago

And this we have in datadog from amazon_health

{
    "id": "AgAAAZHseanQ1LzFsQAAAAAAAAAYAAAAAEFaSHNnSjhRQUFELWZudnVwSEFXcGNvdAAAACQAAAAAMDE5MWVjN2MtZmQ1Zi00MDVjLTg5NDQtMjAxZGUyN2FlMDQy",
    "content": {
        "timestamp": "2024-09-13T17:41:22Z",
        "tags": [
    <redacted>
        ],
        "host": "ip-10-67-83-223.us-west-2.compute.internal-app1d",
        "service": "ec2",
        "message": "%%%\nDescription: \n```\n\"EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance associated with this event in the us-west-2 region. Due to this degradation your instance could already be unreachable. We will stop your instance after 2024-09-27 18:00:00 UTC. Please take appropriate action before this time. You can find more information about retirement events scheduled for your EC2 instances in the AWS Management Console https://console.aws.amazon.com/ec2/v2/home?region=us-west-2#Events * What will happen to my instance? Your instance will be stopped after the specified retirement date. You can start it again at any time after it s stopped. Any data on local instance-store volumes will be lost when the instance is stopped or terminated. * What do I need to do? We recommend that you stop and start the instance which will migrate the instance to a new host. Please note that any data on your local instance-store volumes will not be preserved when you stop and start your instance. For more information about stopping and starting your instance, and what to expect when your instance is stopped, such as the effect on public, private and Elastic IP addresses associated with your instance, see Stop and Start Your Instance in the EC2 User Guide (https://docs.aws.amazon.com/en_us/AWSEC2/latest/UserGuide/Stop_Start.html). However, if you do not need this instance, you can stop it at any time yourself or wait for EC2 to stop it after the retirement date. * Why is EC2 retiring my instance? EC2 may schedule instances for retirement in cases where there is an unrecoverable issue with the underlying hardware. For more information about scheduled retirement events please see the EC2 user guide (https://docs.aws.amazon.com/en_us/AWSEC2/latest/UserGuide/instance-retirement.html). If you have questions or issues, contact AWS Support at: https://aws.amazon.com/support .\"\n```\n\n\nAffected Entities: \n```\n[\n    {\n        \"entityArn\": \"arn:aws:health:us-west-2:182192988802:entity/g1pf6QbeUGUTonkitSbPFG_dx5RL4w0SWzP-b_WKq6djY=1g\",\n        \"eventArn\": \"arn:aws:health:us-west-2::event/EC2/AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED/AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED_4f62b4a3-18ef-4c1c-b87d-b2303fec0174\",\n        \"entityValue\": \"i-07c6f88c9383a3936\",\n        \"awsAccountId\": \"182192988802\",\n        \"lastUpdatedTime\": \"2024-09-13T17:41:22.601000+00:00\",\n        \"statusCode\": \"IMPAIRED\"\n    }\n]\n```\n\n\nEvent start time : 2024-09-27 18:00:00 UTC\nEvent end time : 2024-09-27 18:00:00 UTC\n\n%%%",
        "attributes": {
            "aggregation_key": "arn:aws:health:us-west-2::event/EC2/AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED/AWS_EC2_PERS...",
            "evt": {
                "id": "7750853545852686014",
                "source_id": 201,
                "type": "api"
            },
            "hostname": "i-07c6f88c9383a3936",
            "service": "ec2",
            "priority": "normal",
            "title": "AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED --- upcoming",
            "event_object": "arn:aws:health:us-west-2::event/EC2/AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED/AWS_EC2_PERS...",
            "timestamp": 1726249282000,
            "status": "info"
        }
    }
}

jan-ludvik commented 1 month ago

The node was tainted

{
    "id": "AgAAAZHsebavxXYuzQAAAAAAAAAYAAAAAEFaSHNlY0UwQUFETGw1ek56T3E3TVFBQgAAACQAAAAAMDE5MWVjOGEtYjhmZC00YjlhLWE4ZjQtYmU2MTk2OWMxM2Ex",
    "content": {
        "timestamp": "2024-09-13T17:41:25.295Z",
        "tags": [
<redacted>
        ],
        "host": "i-0c9fc3bbd44946e55",
        "service": "karpenter",
        "message": "tainted node",
        "attributes": {
            "dd": {
                "service": "karpenter"
            },
            "controller": "node.termination",
            "k8s_namespace": "kube-system",
            "level": "INFO",
            "logger": "controller",
            "Node": {
                "name": "ip-10-67-83-223.us-west-2.compute.internal"
            },
            "commit": "b897114",
            "reconcileID": "cc4d8735-b018-40f3-8a48-c44e21a01897",
            "taint": {
                "Value": "",
                "Effect": "NoSchedule",
                "Key": "karpenter.sh/disrupted"
            },
            "namespace": "",
            "name": "ip-10-67-83-223.us-west-2.compute.internal",
            "controllerGroup": "",
            "time": "2024-09-13T17:41:24.331Z",
            "controllerKind": "Node"
        }
    }
}

jan-ludvik commented 1 month ago

And karpenter found the provisionable pod from this node

{
    "id": "AgAAAZHsepGuvO9T_gAAAAAAAAAYAAAAAEFaSHNlcGd5QUFCenZvNUVNNmMyM3dBQQAAACQAAAAAMDE5MWVjOGEtYjhmZC00YjlhLWE4ZjQtYmU2MTk2OWMxM2Ex",
    "content": {
        "timestamp": "2024-09-13T17:42:21.358Z",
        "tags": [
<redacted>
        ],
        "host": "i-0c9fc3bbd44946e55",
        "service": "karpenter",
        "message": "found provisionable pod(s)",
        "attributes": {
            "duration": "193.032533ms",
            "dd": {
                "service": "karpenter"
            },
            "controller": "provisioner",
            "k8s_namespace": "kube-system",
            "level": "INFO",
            "logger": "controller",
            "commit": "b897114",
            "namespace": "",
            "name": "",
            "Pods": "rabbitmq--app1d/rabbitmq-0",
            "time": "2024-09-13T17:42:21.052Z",
            "reconcileID": "839a738f-4d2c-4180-9bac-47cba50e3b6a"
        }
    }
}

jan-ludvik commented 1 month ago

Then it created new nodeclaim but I think the biggest problem is that the pod actually was not rescheduled to new node (although it was deleted) but it ended up on this exact node again. Is it possible the taint was not there yet?

aws / karpenter-provider-aws

AWS scheduled stop with do-not-disrupt pod #7017

Description