aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.56k stars 909 forks source link

Some Karpenter Node Deletions Have No Logged Reason #6254

Closed hybby closed 2 weeks ago

hybby commented 3 months ago

Description

Observed Behavior: Raised by request from AWS support as part of ongoing root cause analysis (see case ID: 171587130200061).

Our nodePools have both expiry (24h) and consolidation (WhenEmpty, 15m). They are configured to use spot instances.

I have observed some instance terminations that have been performed by the Karpenter controller that do not have a reason logged for the termination. This omission makes it difficult/impossible to troubleshoot certain reasons of instance termination caused by Karpenter.

We have also recently upgraded to v0.36.1 from v0.32.9, performing the necessary API objects migration as part of this - the issue only started occurring following this upgrade. At the time of the event noted below, we did not have an interruptionQueue configured due to a misconfiguration (a separate issue which has since been resolved).

It is not possible for me to determine why Karpenter terminated these nodes (it does not seem to obey either either the expiry or consolidation rules that were set up). I believe that a log message should have been emitted prior to the ec2:TerminateInstances call specifying why, which it was not in this case.

As a result of the short notice of this termination, Karpenter did not wait to allow graceful eviction of pods. This caused pods to terminate abruptly, causing some requests that were attempted to be routed to these pods to fail.

I do not think the termination was caused by SpotInstance reclaimation, because it was Karpenter that terminated the instances, not AWS - provable through CloudTrail.

An example, showing nodeclaim registration, a gap of less than a day (proving it was not terminated due to Expiry), then taint, node deletion and nodeclaim deletion:

{"level":"INFO","time":"2024-05-15T06:59:06.531Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-txnxx","provider-id":"aws:///<redacted>/<redacted>","node":"<redacted>.compute.internal"}
...
{"level":"INFO","time":"2024-05-16T02:10:36.753Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"ip-<redacted>.compute.internal"}                                                                                                                                      {"level":"INFO","time":"2024-05-16T02:10:37.073Z","logger":"controller.node.termination","message":"deleted node","commit":"fb4d75f","node":"ip-<redacted>..compute.internal"}
{"level":"INFO","time":"2024-05-16T02:10:37.549Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-txnxx","node":"<redacted>.compute.internal","provider-id":"aws:///<redacted>/i-<redacted>"}

I see this same node terminated by the Karpenter controller in CloudTrail, at exactly the same time:

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "AssumedRole",
...redacted...
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AROAUWJHVGBT6QBIAKIAH",
                "arn": "arn:aws:iam::<redacted>:role/prd-KarpenterController",
...
                "userName": "prd-KarpenterController"
            },
...
    "eventTime": "2024-05-16T02:10:37Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "TerminateInstances",
...
    "sourceIPAddress": "18.202.70.33",
    "userAgent": "aws-sdk-go/1.51.16 (go1.22.2; linux; amd64) karpenter.sh-0.36.1",
    "requestParameters": {
        "instancesSet": {
            "items": [
                {
                    "instanceId": "i-<redacted>"
                }
            ]
        }
    },
...
            "items": [
                {
                    "instanceId": "i-<redacted>",
                    "currentState": {
                        "code": 48,
                        "name": "terminated"
                    },
                    "previousState": {
                        "code": 48,
                        "name": "terminated"
                    }
                }

Expected Behavior: I would expect a line prior to the taint and deletion telling me the reason for deletion. For example, for node expiry:

{"level":"INFO","time":"2024-05-16T13:30:38.116Z","logger":"controller.disruption","message":"triggering termination for expired node after TTL","commit":"fb4d75f","ttl":"24h0m0s"}                                                                                                                                          {"level":"INFO","time":"2024-05-16T13:30:38.116Z","logger":"controller.disruption","message":"disrupting via expiration replace, terminating 1 nodes (30 pods) <redacted>.compute.internal/c6a.4xlarge/spot and replacing with spot node from types c6a.4xlarge, c5.4xlarge, c5a.4xlarge","commit":"fb4d75f","command-id":"43cd0dcd-a7d1-4f43-afd6-09d81267170b"}
...
{"level":"INFO","time":"2024-05-16T13:31:24.552Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"<redacted>.eu-west-1.compute.internal"} 
... etc ...

Reproduction Steps (Please include YAML): Noted as part of operations; I have not observed or been able to reproduce this outside of our running system.

See NodePool and NodeClass below:

NodePool:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "12556785943003641189"
    karpenter.sh/nodepool-hash-version: v2
  creationTimestamp: "2024-05-09T14:55:58Z"
  generation: 1
  name: spot-b
  resourceVersion: "582580156"
  uid: e293ef5f-9c47-42ea-a65b-173f4a32a395
spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: 15m
    consolidationPolicy: WhenEmpty
    expireAfter: 24h
  limits:
    cpu: 1k
  template:
    metadata:
      labels:
        colour: b
        spot: "true"
    spec:
      nodeClassRef:
        name: spot-b
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - c5.xlarge
        - c5.2xlarge
        - c5.4xlarge
        - c5a.xlarge
        - c5a.2xlarge
        - c5a.4xlarge
        - c6a.xlarge
        - c6a.2xlarge
        - c6a.4xlarge
        - c6i.xlarge
        - c6i.2xlarge
        - m5.large
        - m5.xlarge
        - m5.2xlarge
        - m6a.large
        - m6a.xlarge
        - m6a.2xlarge
        - m6i.large
        - m6i.xlarge
        - m6i.2xlarge
        - r5.large
        - r5.xlarge
        - r6a.large
        - r6a.xlarge
        - r6i.large
        - r6i.xlarge
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      taints:
      - effect: NoSchedule
        key: spot
        value: "true"
status:
  resources:
    cpu: "88"
    ephemeral-storage: 335445856Ki
    memory: 183611308Ki
    pods: "1168"

EC2NodeClass

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  annotations:
    karpenter.k8s.aws/ec2nodeclass-hash: "3052933320968790403"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v2
  creationTimestamp: "2024-05-09T14:56:02Z"
  finalizers:
  - karpenter.k8s.aws/termination
  generation: 1
  name: spot-b
  resourceVersion: "582649354"
  uid: 99748932-f789-4fd6-b05d-e72966891cc7
spec:
  amiFamily: AL2
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      encrypted: true
      volumeSize: 40Gi
      volumeType: gp3
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  role: prd-KarpenterNodeRole
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: <redacted>
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: <redacted>
  tags:
    Name: prd-b-karp-spot
status:
  amis:
  - id: ami-<redacted>
    name: <redacted>
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - arm64
    - key: karpenter.k8s.aws/instance-gpu-count
      operator: DoesNotExist
    - key: karpenter.k8s.aws/instance-accelerator-count
      operator: DoesNotExist
  - id: ami-<redacted>
    name: <redacted>
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64
    - key: karpenter.k8s.aws/instance-gpu-count
      operator: Exists
  - id: ami-<redacted>
    name: <redacted>
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64
    - key: karpenter.k8s.aws/instance-accelerator-count
      operator: Exists
  - id: ami-<redacted>
    name: <redacted>
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64
    - key: karpenter.k8s.aws/instance-gpu-count
      operator: DoesNotExist
    - key: karpenter.k8s.aws/instance-accelerator-count
      operator: DoesNotExist
  instanceProfile: prd-<redacted>
  securityGroups:
  - id: sg-<redacted>
    name:<redacted>
  - id: sg-<redacted>
    name: prd-<redacted>
  subnets:
  - id: subnet-<redacted>
    zone: <redacted>
  - id: subnet-<redacted>
    zone: <redacted>
  - id: subnet-<redacted>
    zone: <redacted>

Versions:

jmdeal commented 3 months ago

Could you share your full Karpenter logs? There should be message starting with "disrupting via" that proceeds any voluntary disruption. There should also be a different message before any spot interruption event, though Karpenter does try and preemptively remove the instance so you could see Karpenter performing the deletion in that scenario.

hybby commented 2 months ago

Hi @jmdeal - thanks for looking at this.

The strange thing about this particular event that I'm highlighting was that there was no disrupting via, even though we do see those messages for earlier terminations. I'm interested to dig into scenarios where Karpenter would not log when it deletes a node and terminates an instance.

Please find the full Karpenter log for the period of 2024-05-16T02:.* - I've had to redact certain information, but in all other regards the log is full as reported from Karpenter (it's effectively a grep '2024-05-16T02:' karpenter.log). I hope that's okay. I've left partial node identifiers intact - the node in question that didn't log a termination event was ip-<redacted_subnet>-18-33. This is instance i-0b94ea54e8444e6e6.

{"level":"INFO","time":"2024-05-16T02:10:36.753Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"ip-<redacted_subnet>-18-33.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:10:37.073Z","logger":"controller.node.termination","message":"deleted node","commit":"fb4d75f","node":"ip-<redacted_subnet>-18-33.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:10:37.549Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-txnxx","node":"ip-<redacted_subnet>-18-33.<redacted_region>.compute.internal","provider-id":"aws:///<redacted_region>b/i-0b94ea54e8444e6e6"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-84785587cc-d5xcw\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-c76c7b59d-h95j7\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-77d97cb769-xjvc9\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-7dc96d4dcf-qd8mb\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-6b545fc7b-cwtt7\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-85f7ff8787-5xrwf\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-7f5cd875f7-9gfhn\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-78dcbd5fd7-bt5zt\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-89d7d77d8-ml2qh\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-c6678c9dd-nq8tb\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-56bcbb5ff5-dx9k7\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.506Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-78d8db59d7-zpsx5\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:10:40.585Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-85f7ff8787-5xrwf and 7 other(s)","duration":"81.176893ms"}
{"level":"INFO","time":"2024-05-16T02:10:40.585Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"fb4d75f","nodeclaims":1,"pods":12}
{"level":"INFO","time":"2024-05-16T02:10:40.596Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"fb4d75f","nodepool":"spot-b","nodeclaim":"spot-b-r4srl","requests":{"cpu":"6680m","memory":"11294Mi","pods":"20"},"instance-types":"c5.2xlarge, c5.4xlarge, c5a.2xlarge, c5a.4xlarge, c6a.2xlarge and 5 other(s)"}
{"level":"INFO","time":"2024-05-16T02:10:42.578Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-r4srl","provider-id":"aws:///<redacted_region>b/i-03f59bdfa13f2d20e","instance-type":"c5.2xlarge","zone":"<redacted_region>b","capacity-type":"spot","allocatable":{"cpu":"7910m","ephemeral-storage":"35Gi","memory":"14162Mi","pods":"58","vpc.amazonaws.com/pod-eni":"38"}}
{"level":"INFO","time":"2024-05-16T02:10:50.575Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-85f7ff8787-5xrwf and 7 other(s)","duration":"71.798101ms"}
{"level":"INFO","time":"2024-05-16T02:11:00.582Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-85f7ff8787-5xrwf and 7 other(s)","duration":"77.31605ms"}
{"level":"INFO","time":"2024-05-16T02:11:03.806Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-r4srl","provider-id":"aws:///<redacted_region>b/i-03f59bdfa13f2d20e","node":"ip-<redacted_subnet>-18-174.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:11:10.576Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-85f7ff8787-5xrwf and 7 other(s)","duration":"71.272592ms"}
{"level":"INFO","time":"2024-05-16T02:11:19.725Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-r4srl","provider-id":"aws:///<redacted_region>b/i-03f59bdfa13f2d20e","node":"ip-<redacted_subnet>-18-174.<redacted_region>.compute.internal","allocatable":{"cpu":"7910m","ephemeral-storage":"37569620724","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"14879196Ki","pods":"58"}}
{"level":"INFO","time":"2024-05-16T02:12:18.536Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"ip-<redacted_subnet>-12-76.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:12:19.281Z","logger":"controller.node.termination","message":"deleted node","commit":"fb4d75f","node":"ip-<redacted_subnet>-12-76.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:12:19.606Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-c2whp","node":"ip-<redacted_subnet>-12-76.<redacted_region>.compute.internal","provider-id":"aws:///<redacted_region>a/i-0f938d0d982eddb74"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-5cd5f4d9cd-2tdk2\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-84959f49bb-c6jpp\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-5598ddfb74-gjg4f\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-64876d869-fspnt\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-578648d474-6w72p\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-cbdf4f4f9-66q9g\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-844995f477-4mrmd\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-67ff785bd4-tpjp8\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-64c878f96b-zvp4k\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-67b9f669f8-r4rsp\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.097Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-c98cbc6cf-452wr\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:12:24.226Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-844995f477-4mrmd and 6 other(s)","duration":"131.457729ms"}
{"level":"INFO","time":"2024-05-16T02:12:24.226Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"fb4d75f","nodeclaims":1,"pods":11}
{"level":"INFO","time":"2024-05-16T02:12:24.253Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"fb4d75f","nodepool":"spot-b","nodeclaim":"spot-b-t82jt","requests":{"cpu":"6180m","memory":"11386Mi","pods":"19"},"instance-types":"c5.2xlarge, c5.4xlarge, c5a.2xlarge, c5a.4xlarge, c6a.2xlarge and 5 other(s)"}
{"level":"INFO","time":"2024-05-16T02:12:26.284Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-t82jt","provider-id":"aws:///<redacted_region>a/i-0c946df4c47d0b561","instance-type":"c5.2xlarge","zone":"<redacted_region>a","capacity-type":"spot","allocatable":{"cpu":"7910m","ephemeral-storage":"35Gi","memory":"14162Mi","pods":"58","vpc.amazonaws.com/pod-eni":"38"}}
{"level":"INFO","time":"2024-05-16T02:12:34.167Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-844995f477-4mrmd and 6 other(s)","duration":"71.301313ms"}
{"level":"INFO","time":"2024-05-16T02:12:44.168Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-844995f477-4mrmd and 6 other(s)","duration":"71.14042ms"}
{"level":"INFO","time":"2024-05-16T02:12:54.169Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-844995f477-4mrmd and 6 other(s)","duration":"72.239342ms"}
{"level":"INFO","time":"2024-05-16T02:12:54.692Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-t82jt","provider-id":"aws:///<redacted_region>a/i-0c946df4c47d0b561","node":"ip-<redacted_subnet>-12-14.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:13:04.168Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-844995f477-4mrmd and 6 other(s)","duration":"70.90903ms"}
{"level":"INFO","time":"2024-05-16T02:13:10.726Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-t82jt","provider-id":"aws:///<redacted_region>a/i-0c946df4c47d0b561","node":"ip-<redacted_subnet>-12-14.<redacted_region>.compute.internal","allocatable":{"cpu":"7910m","ephemeral-storage":"37569620724","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"14764516Ki","pods":"58"}}
{"level":"INFO","time":"2024-05-16T02:13:16.532Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-84785587cc-tfpwn\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:13:16.532Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-7f5cd875f7-rz8ds\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:13:16.532Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-6b545fc7b-5fv7w\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:13:16.532Z","logger":"controller.provisioner","message":"pod \"<redacted_service_name>-c76c7b59d-jwntb\" has a preferred TopologySpreadConstraint which can prevent consolidation","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-16T02:13:16.690Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-c76c7b59d-jwntb","duration":"160.230382ms"}
{"level":"INFO","time":"2024-05-16T02:13:16.690Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"fb4d75f","nodeclaims":1,"pods":4}
{"level":"INFO","time":"2024-05-16T02:13:16.700Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"fb4d75f","nodepool":"spot-b","nodeclaim":"spot-b-stspw","requests":{"cpu":"2680m","memory":"4918Mi","pods":"12"},"instance-types":"c5.2xlarge, c5.4xlarge, c5.xlarge, c5a.2xlarge, c5a.4xlarge and 15 other(s)"}
{"level":"INFO","time":"2024-05-16T02:13:18.510Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-stspw","provider-id":"aws:///<redacted_region>a/i-0cf064b31c2625a0f","instance-type":"c5.xlarge","zone":"<redacted_region>a","capacity-type":"spot","allocatable":{"cpu":"3920m","ephemeral-storage":"35Gi","memory":"6584Mi","pods":"58","vpc.amazonaws.com/pod-eni":"18"}}
{"level":"INFO","time":"2024-05-16T02:13:26.594Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-c76c7b59d-jwntb","duration":"62.883376ms"}
{"level":"INFO","time":"2024-05-16T02:13:36.596Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-c76c7b59d-jwntb","duration":"65.077591ms"}
{"level":"INFO","time":"2024-05-16T02:13:46.481Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-stspw","provider-id":"aws:///<redacted_region>a/i-0cf064b31c2625a0f","node":"ip-<redacted_subnet>-15-39.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:13:46.599Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-c76c7b59d-jwntb","duration":"67.23661ms"}
{"level":"INFO","time":"2024-05-16T02:13:56.605Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"<redacted_service_name>-c76c7b59d-jwntb","duration":"71.701185ms"}
{"level":"INFO","time":"2024-05-16T02:14:02.522Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"fb4d75f","nodeclaim":"spot-b-stspw","provider-id":"aws:///<redacted_region>a/i-0cf064b31c2625a0f","node":"ip-<redacted_subnet>-15-39.<redacted_region>.compute.internal","allocatable":{"cpu":"3920m","ephemeral-storage":"37569620724","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"6749636Ki","pods":"58"}}
jmdeal commented 2 months ago

Since the logs start with the node you called out, ip-<redacted_subnet>-18-33, I can't tell if there was any previous indication. I did notice a similar event:

{"level":"INFO","time":"2024-05-16T02:12:18.536Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"ip-<redacted_subnet>-12-76.<redacted_region>.compute.internal"}
{"level":"INFO","time":"2024-05-16T02:12:19.281Z","logger":"controller.node.termination","message":"deleted node","commit":"fb4d75f","node":"ip-<redacted_subnet>-12-76.<redacted_region>.compute.internal"}

If this was caused by voluntary disruption (e.g. consolidation, expiration, or drift) there would have been a log line from the disruption controller indicating the disruption decision and an additional log line from when the karpenter.sh/disruption taint was applied by the disruption controller (we only see it applied by the termination controller here). Spot interruption would have also added a log line as well as published an event to the node. As far as I'm aware there isn't any form of disruption driven by Karpenter that should have this shape. Is it possible that some other process is deleting the nodes, triggering Karpenter's termination controller?

github-actions[bot] commented 1 month ago

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.