Open Nuru opened 5 days ago
Can you share the NodePool that you are using here? I'm curious about the flexibility of the request that is being sent to CreateFleet.
@jonathan-innis What information do you want that is not included in the issue description, particularly under "Reproduction Steps" -> "Click to Reveal: Requirements"?
Oh, yep. I missed the drop-down section since it was hiding the requirements block.
Description
The algorithm selecting Spot Instances is not reliably allocating instances which can go for 30 minutes without interruption. This seems to be new behavior to me.
Over 5% of Nodes launched were interrupted within 3 minutes of launch. That is barely enough time for a Node to become ready under EKS.
Observed Behavior:
Analyzing logs over about 16 hours, with 240 NodeClaims created, 29 (12%) of Nodes launched were interrupted. Defining the "time to interruption" as the duration between the
"launched nodeclaim"
log entry and the"initiating delete from interruption message"
log entry, we compute:Click to Reveal: Time to Interruption by Instance Type
| Instance Type | Time to Interruption | |--------------------|-------------------------------------------| | **m6i.4xlarge** | `2m8s`, `2m8s`, `2m19s`, `6m10s`, `9m45s`, `11m16s`, `11m16s` | | **m6a.4xlarge** | `2m10s`, `16m29s`, `16m29s`, `16m30s`, `16m31s`, `16m32s`, `16m33s` | | **m7i.4xlarge** | `2m2s`, `2m4s`, `2m4s`, `2m32s`, `4m4s`, `4m4s`, `6m54s` | | **m7i-flex.4xlarge** | `2m20s`, `2m48s`, `2m50s`, `3m18s`, `3m51s` | | **c6i.4xlarge** | `1m1s`, `2m32s` | | **m7a.4xlarge** | `12m25s` |Here is an extreme example. This NodeClaim was interrupted 63 seconds after it was created.
Expected Behavior:
I have a NodePool of Spot Instances handling transient jobs, meaning the pool scales up and down a lot (especially since
WhenUnderutilized
does not have a waiting period). The median lifespan of a Node in this NodePool is a little over 7 minutes.I would expect I could comfortably use Spot Instances for this NodePool and reliably have Instances that would go 30 minutes before being interrupted, especially since I currently do not have a choice of allocation strategy.
Reproduction Steps (Please include YAML):
Create a spot pool and scale it up and down a lot?
Click to Reveal: Requirements
```yaml requirements: - key: "karpenter.sh/capacity-type" operator: "In" values: - "on-demand" - "spot" - key: "karpenter.k8s.aws/instance-encryption-in-transit-supported" operator: "In" values: ["true"] - key: "karpenter.k8s.aws/instance-hypervisor" operator: In values: ["nitro"] - key: "karpenter.k8s.aws/instance-cpu" operator: Gt # Exclude instance types with 1 or 2 vCPUs values: ["3"] - key: "karpenter.k8s.aws/instance-cpu" operator: Lt values: ["32"] - key: "karpenter.k8s.aws/instance-memory" operator: Lt values: ["86016"] - key: "kubernetes.io/arch" operator: In values: - "amd64" - key: "karpenter.k8s.aws/instance-generation" operator: Gt values: ["5"] - key: "kubernetes.io/os" operator: "In" values: - "linux" ```Versions:
Chart Version: 0.37.0
Kubernetes Version (
kubectl version
): 1.29.3Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment