Spot instances have unacceptably short lifetimes

Nuru commented 5 days ago

Description

The algorithm selecting Spot Instances is not reliably allocating instances which can go for 30 minutes without interruption. This seems to be new behavior to me.

Over 5% of Nodes launched were interrupted within 3 minutes of launch. That is barely enough time for a Node to become ready under EKS.

Observed Behavior:

Analyzing logs over about 16 hours, with 240 NodeClaims created, 29 (12%) of Nodes launched were interrupted. Defining the "time to interruption" as the duration between the "launched nodeclaim" log entry and the "initiating delete from interruption message" log entry, we compute:

Mean Time to Interruption: 6:59 (6 minutes, 59 seconds, or 419 seconds)
Median Time to Interruption: 3:59
Minimum Time to Interruption: 1:03
Maximum Time to Interruption: 16:36

Click to Reveal: Time to Interruption by Instance Type

| Instance Type | Time to Interruption | |--------------------|-------------------------------------------| | **m6i.4xlarge** | `2m8s`, `2m8s`, `2m19s`, `6m10s`, `9m45s`, `11m16s`, `11m16s` | | **m6a.4xlarge** | `2m10s`, `16m29s`, `16m29s`, `16m30s`, `16m31s`, `16m32s`, `16m33s` | | **m7i.4xlarge** | `2m2s`, `2m4s`, `2m4s`, `2m32s`, `4m4s`, `4m4s`, `6m54s` | | **m7i-flex.4xlarge** | `2m20s`, `2m48s`, `2m50s`, `3m18s`, `3m51s` | | **c6i.4xlarge** | `1m1s`, `2m32s` | | **m7a.4xlarge** | `12m25s` |

Here is an extreme example. This NodeClaim was interrupted 63 seconds after it was created.

{"level":"INFO","time":"2024-06-27T18:02:00.727Z","logger":"controller","message":"created nodeclaim","commit":"490ef94","controller":"provisioner","NodePool":{"name":"transient"},"NodeClaim":{"name":"transient-l7wn6"},"requests":{"cpu":"9180m","ephemeral-storage":"4Gi","memory":"24696Mi","pods":"10"},"instance-types":"c6a.4xlarge, c6i.4xlarge, c6id.4xlarge, c6in.4xlarge, c7a.4xlarge and 10 other(s)"}
{"level":"INFO","time":"2024-06-27T18:02:02.644Z","logger":"controller","message":"launched nodeclaim","commit":"490ef94","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"transient-l7wn6"},"namespace":"","name":"transient-l7wn6","reconcileID":"7780f097-7225-44e8-8f7b-afafe57c9f1c","provider-id":"aws:///us-west-2c/i-057f7e187e91ff637","instance-type":"c6i.4xlarge","zone":"us-west-2c","capacity-type":"spot","allocatable":{"cpu":"15890m","ephemeral-storage":"179Gi","memory":"27381Mi","pods":"234","vpc.amazonaws.com/pod-eni":"54"}}
{"level":"INFO","time":"2024-06-27T18:02:20.906Z","logger":"controller","message":"registered nodeclaim","commit":"490ef94","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"transient-l7wn6"},"namespace":"","name":"transient-l7wn6","reconcileID":"f945cce9-6018-4ff5-b829-180cd10c4c78","provider-id":"aws:///us-west-2c/i-057f7e187e91ff637","Node":{"name":"ip-10-88-139-160.us-west-2.compute.internal"}}
{"level":"INFO","time":"2024-06-27T18:02:33.933Z","logger":"controller","message":"initialized nodeclaim","commit":"490ef94","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"transient-l7wn6"},"namespace":"","name":"transient-l7wn6","reconcileID":"2711f1f7-691b-4125-87e1-34539e2669ca","provider-id":"aws:///us-west-2c/i-057f7e187e91ff637","Node":{"name":"ip-10-88-139-160.us-west-2.compute.internal"},"allocatable":{"cpu":"15890m","ephemeral-storage":"192103530586","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"29286936Ki","pods":"234"}}
{"level":"INFO","time":"2024-06-27T18:03:03.124Z","logger":"controller","message":"initiating delete from interruption message","commit":"490ef94","controller":"interruption","queue":"smrx-core-usw2-auto-karpenter","messageKind":"SpotInterruptionKind","NodeClaim":{"name":"transient-l7wn6"},"action":"CordonAndDrain","Node":{"name":"ip-10-88-139-160.us-west-2.compute.internal"}}
{"level":"INFO","time":"2024-06-27T18:03:31.753Z","logger":"controller","message":"deleted nodeclaim","commit":"490ef94","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"transient-l7wn6"},"namespace":"","name":"transient-l7wn6","reconcileID":"e0ebc180-c263-4e55-a6ad-63eb1d1d4761","Node":{"name":"ip-10-88-139-160.us-west-2.compute.internal"},"provider-id":"aws:///us-west-2c/i-057f7e187e91ff637"}

Expected Behavior:

I have a NodePool of Spot Instances handling transient jobs, meaning the pool scales up and down a lot (especially since WhenUnderutilized does not have a waiting period). The median lifespan of a Node in this NodePool is a little over 7 minutes.

I would expect I could comfortably use Spot Instances for this NodePool and reliably have Instances that would go 30 minutes before being interrupted, especially since I currently do not have a choice of allocation strategy.

Reproduction Steps (Please include YAML):

Create a spot pool and scale it up and down a lot?

Click to Reveal: Requirements

```yaml requirements: - key: "karpenter.sh/capacity-type" operator: "In" values: - "on-demand" - "spot" - key: "karpenter.k8s.aws/instance-encryption-in-transit-supported" operator: "In" values: ["true"] - key: "karpenter.k8s.aws/instance-hypervisor" operator: In values: ["nitro"] - key: "karpenter.k8s.aws/instance-cpu" operator: Gt # Exclude instance types with 1 or 2 vCPUs values: ["3"] - key: "karpenter.k8s.aws/instance-cpu" operator: Lt values: ["32"] - key: "karpenter.k8s.aws/instance-memory" operator: Lt values: ["86016"] - key: "kubernetes.io/arch" operator: In values: - "amd64" - key: "karpenter.k8s.aws/instance-generation" operator: Gt values: ["5"] - key: "kubernetes.io/os" operator: "In" values: - "linux" ```

Versions:

Chart Version: 0.37.0
Kubernetes Version (kubectl version): 1.29.3
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jonathan-innis commented 1 day ago

Can you share the NodePool that you are using here? I'm curious about the flexibility of the request that is being sent to CreateFleet.

Nuru commented 1 day ago

@jonathan-innis What information do you want that is not included in the issue description, particularly under "Reproduction Steps" -> "Click to Reveal: Requirements"?

jonathan-innis commented 1 day ago

Oh, yep. I missed the drop-down section since it was hiding the requirements block.

aws / karpenter-provider-aws

Spot instances have unacceptably short lifetimes #6425

Description