Karpenter launches on-demand instances instead of spot

ilkinmammadzada commented 4 months ago

Description

Observed Behavior:

Karpenter launches on-demand instances even thought spot market has quite enough ec2 to respond our requirements

message: {"level":"INFO","time":"2024-02-27T17:43:27.399Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"17d6c05","nodepool":"default-uswest2bprod","nodeclaim":"default-uswest2bprod-xbxdw","requests":{"cpu":"120","ephemeral-storage":"45Gi","memory":"194860Mi","pods":"22"},"instance-types":"c6i.32xlarge, c6i.metal, c6id.32xlarge, c6id.metal, c6in.32xlarge and 23 other(s)"}

message: {"level":"WARN","time":"2024-02-27T17:43:30.235Z","logger":"controller.nodeclaim.lifecycle","message":"at least 5 instance types are recommended when flexible to spot but requesting on-demand, the current provisioning request only has 1 instance type options","commit":"17d6c05","nodeclaim":"default-uswest2bprod-xbxdw"}

message: {"level":"INFO","time":"2024-02-27T17:43:31.886Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"17d6c05","nodeclaim":"default-uswest2bprod-xbxdw","provider-id":"aws:///us-west-2b/i-xxxx","instance-type":"c6i.32xlarge","zone":"us-west-2b","capacity-type":"on-demand","allocatable":{"cpu":"125610m","ephemeral-storage":"206336Mi","memory":"235731254Ki","pods":"200","vpc.amazonaws.com/pod-eni":"107"}}

message: {"level":"INFO","time":"2024-02-27T17:47:05.891Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"17d6c05","nodeclaim":"default-uswest2bprod-xbxdw","provider-id":"aws:///us-west-2b/i-xxxx","node":"yyyy"}

Expected Behavior:

I would expect launching on-demand only if there is no spot instances with given requirements. (Karpenter prioritizes Spot offerings if the NodePool allows Spot and on-demand instances)

Reproduction Steps (Please include YAML):

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default-uswest2bprod
spec:
  template:
    metadata:
      labels:
        x.com/pool: default
    spec:
      nodeClassRef:
        name: default-uswest2bprod
      requirements:
      - key: x.com/pool
        operator: In
        values:
        - default
      - key: karpenter.k8s.aws/instance-cpu
        operator: Gt
        values:
        - "31"
      - key: karpenter.k8s.aws/instance-cpu
        operator: Lt
        values:
        - "192"
      - key: karpenter.k8s.aws/instance-family
        operator: In
        values:
        - m5
        - m5d
        - m5dn
        - m5n
        - m5zn
        - m6i
        - m6id
        - m6idn
        - m6in
        - m7i
        - r5
        - r5b
        - r5dn
        - r5n
        - r6i
        - r6id
        - r6idn
        - r6in
        - r7i
        - i4i
        - c5
        - c5d
        - c5n
        - c6i
        - c6id
        - c6in
        - c7i
        - x2idn
        - x2iedn
        - x2iezn
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - us-west-2b
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
        - on-demand
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      startupTaints:
      - effect: NoSchedule
        key: key

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default-uswest2bprod
spec:
  amiFamily: Custom
  instanceStorePolicy: RAID0
  amiSelectorTerms:
  - name: xxxx
  blockDeviceMappings:
  - deviceName: /dev/sda1
    ebs:
      deleteOnTermination: true
      volumeSize: 225Gi
      volumeType: gp2
  instanceProfile: xxxx
  securityGroupSelectorTerms:
  - tags:
      Name: xxxx
  subnetSelectorTerms:
  - tags:
      Name: xxxx

Versions:

Chart Version: v0.34
Kubernetes Version (kubectl version): v1.23.17
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jigisha620 commented 4 months ago

@ilkinmammadzada Would you mind providing your nodeclass configuration as well as your deployment configuration?

ilkinmammadzada commented 4 months ago

@ilkinmammadzada Would you mind providing your nodeclass configuration as well as your deployment configuration?

I added EC2NodeClass details as well.

jonathan-innis commented 4 months ago

@ilkinmammadzada Can you share the NodeClaim associated with one of these requests? That should have all of the instance types that we are trying to launch with which should be helpful here to track down if the spot instance capacity that we are trying to launch with is just all more expensive than our cheapest on-demand capacity.

We have this filtering function which automatically gets rid of any spot capacity that is more expensive than the cheapest on-demand instance type. The rationale being: why would you go get a spot instance type if an on-demand instance type is cheaper and more available for the pod capacity that you need right now.

ilkinmammadzada commented 4 months ago

@jonathan-innis yeah sure. They were expired and deleted. I will share once new on-demands ec2s appear.

ilkinmammadzada commented 4 months ago

I'm also curious about WARN log line (at least 5 instance types are recommended when flexible to spot but requesting on-demand, the current provisioning request only has 1 instance type options). I can see that karpenter had 28 different instance-types in previous log line. Probably karpenter filtered them as @jonathan-innis mentioned

ilkinmammadzada commented 4 months ago

It happened again, you can check the example NodeClaim. Probably Unrelated with this issue but you can see that ephemeral-storage information is also wrong (ephemeral-storage: 89Gi), I would expect 225Gi (volumeSize: 225Gi).

You may notice region difference (us-west-2 and us-east-1). Their configurations are identical.

- apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
metadata:
  annotations:
    karpenter.k8s.aws/tagged: "true"
  creationTimestamp: "2024-02-29T17:03:48Z"
  finalizers:
  - karpenter.sh/termination
  generateName: default-useast1bprod-
  generation: 1
  labels:
    karpenter.k8s.aws/instance-category: m
    karpenter.k8s.aws/instance-cpu: "128"
    karpenter.k8s.aws/instance-encryption-in-transit-supported: "true"
    karpenter.k8s.aws/instance-family: m6i
    karpenter.k8s.aws/instance-generation: "6"
    karpenter.k8s.aws/instance-hypervisor: nitro
    karpenter.k8s.aws/instance-memory: "524288"
    karpenter.k8s.aws/instance-network-bandwidth: "50000"
    karpenter.k8s.aws/instance-size: 32xlarge
    karpenter.sh/capacity-type: on-demand
    karpenter.sh/nodepool: default-useast1bprod
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: m6i.32xlarge
    topology.kubernetes.io/region: us-east-1
    topology.kubernetes.io/zone: us-east-1b
    xxx.com/pool: default
  name: default-useast1bprod-xxxx
  ownerReferences:
  - apiVersion: karpenter.sh/v1beta1
    blockOwnerDeletion: true
    kind: NodePool
    name: default-useast1bprod
    uid: 70af2e0a-50cb-4b1e-9828-5cdace4a3735
  resourceVersion: "14879954814"
  uid: 0551e82c-d740-4d32-95a9-1e98e646426e
spec:
  kubelet:
    maxPods: 200
    systemReserved:
      cpu: "2"
      memory: 4G
  nodeClassRef:
    name: default-useast1bprod
  requirements:
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - i4i.32xlarge
    - i4i.metal
    - m6i.32xlarge
    - m6i.metal
    - m6id.32xlarge
    - m6id.metal
    - m6idn.32xlarge
    - m6idn.metal
    - m6in.32xlarge
    - m6in.metal
    - r6i.32xlarge
    - r6i.metal
    - r6id.32xlarge
    - r6id.metal
    - r6idn.32xlarge
    - r6idn.metal
    - r6in.32xlarge
    - r6in.metal
    - x2idn.32xlarge
    - x2idn.metal
    - x2iedn.32xlarge
    - x2iedn.metal
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - us-east-1b
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
    - spot
  - key: karpenter.sh/nodepool
    operator: In
    values:
    - default-useast1bprod
  - key: xxx.com/pool
    operator: In
    values:
    - default
  - key: karpenter.k8s.aws/instance-cpu
    operator: Gt
    values:
    - "31"
  - key: karpenter.k8s.aws/instance-family
    operator: In
    values:
    - c5
    - c5d
    - c5n
    - c6i
    - c6id
    - c6in
    - c7i
    - i4i
    - m5
    - m5d
    - m5dn
    - m5n
    - m5zn
    - m6i
    - m6id
    - m6idn
    - m6in
    - m7i
    - r5
    - r5b
    - r5dn
    - r5n
    - r6i
    - r6id
    - r6idn
    - r6in
    - r7i
    - x2idn
    - x2iedn
    - x2iezn
  resources:
    requests:
      cpu: 125200m
      ephemeral-storage: 37Gi
      memory: 265898Mi
      pods: "20"
  startupTaints:
  - effect: NoSchedule
    key: key
status:
  allocatable:
    cpu: 125610m
    ephemeral-storage: 89Gi
    memory: 484033846Ki
    pods: "200"
    vpc.amazonaws.com/pod-eni: "107"
  capacity:
    cpu: "128"
    ephemeral-storage: 100Gi
    memory: 484966Mi
    pods: "200"
    vpc.amazonaws.com/pod-eni: "107"
  conditions:
  - lastTransitionTime: "2024-03-01T09:03:48Z"
    severity: Warning
    status: "True"
    type: Expired
  - lastTransitionTime: "2024-02-29T17:07:57Z"
    status: "True"
    type: Initialized
  - lastTransitionTime: "2024-02-29T17:03:52Z"
    status: "True"
    type: Launched
  - lastTransitionTime: "2024-02-29T17:07:57Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-02-29T17:07:37Z"
    status: "True"
    type: Registered
  imageID: ami-xxxx
  nodeName: xxxxx
  providerID: aws:///us-east-1b/i-xxxxx

ilkinmammadzada commented 4 months ago

I just checked CreateFleet log to confirm which ec2 types were filtered. Apparently karpenter added only one ec2 type. And it's really weird that I can't find "spot" attempt before "on-demand" (checking logs between "nodeClaim creation time" and "nodeClaim launching time"). I'm seeing an error right before on-demand launching. eventName: DescribeLaunchTemplates -> errorCode: Client.InvalidLaunchTemplateName.NotFoundException errorMessage: At least one of the launch templates specified in the request does not exist. IIRC I saw the same error for the previous case as well. May it be the clue to find potential bug?
I will try to check when it happens again.

{
  "CreateFleetRequest": {
    "TargetCapacitySpecification": {
      "DefaultTargetCapacityType": "on-demand",
      "TotalTargetCapacity": 4
    },
    "Type": "instant",
    "OnDemandOptions": {
      "AllocationStrategy": "lowest-price"
    },
    "LaunchTemplateConfigs": {
      "LaunchTemplateSpecification": {
        "LaunchTemplateName": "karpenter.k8s.aws/xxx",
        "Version": "$Latest"
      },
      "Overrides": {
        "ImageId": "ami-xxx",
        "AvailabilityZone": "us-east-1b",
        "tag": 1,
        "SubnetId": "subnet-xxx",
        "InstanceType": "m6i.32xlarge"
      },
      "tag": 1
    },
    "TagSpecification": []
  }
}

jigisha620 commented 4 months ago

Unrelated with this issue but you can see that ephemeral-storage information is also wrong (ephemeral-storage: 89Gi), I would expect 225Gi (volumeSize: 225Gi)

There is an eventual consistency issue with how we compute ephemeral storage. We are tracking it in a Github Issue - https://github.com/aws/karpenter-provider-aws/issues/5756

engedaam commented 4 months ago

@ilkinmammadzada Are there multiple NodePools and NodeClasses in this cluster?

ilkinmammadzada commented 4 months ago

@ilkinmammadzada Are there multiple NodePools and NodeClasses in this cluster?

There are several NodePools, but they are mutually exclusive (xxx.com/pool). I would like to understand why karpenter wanted to launch only following "m, i, r, x" and "32xl, metal" :

key: node.kubernetes.io/instance-type operator: In values:
- i4i.32xlarge
- i4i.metal
- m6i.32xlarge
- m6i.metal
- m6id.32xlarge
- m6id.metal
- m6idn.32xlarge
- m6idn.metal
- m6in.32xlarge
- m6in.metal
- r6i.32xlarge
- r6i.metal
- r6id.32xlarge
- r6id.metal
- r6idn.32xlarge
- r6idn.metal
- r6in.32xlarge
- r6in.metal
- x2idn.32xlarge
- x2idn.metal
- x2iedn.32xlarge
- x2iedn.metal

engedaam commented 4 months ago

The NodePool you have provided does not match the NodeClaim that was provisioned. Can you provide the default-useast1bprod NodePool?

ilkinmammadzada commented 4 months ago

Unrelated with this issue but you can see that ephemeral-storage information is also wrong (ephemeral-storage: 89Gi), I would expect 225Gi (volumeSize: 225Gi)

There is an eventual consistency issue with how we compute ephemeral storage. We are tracking it in a Github Issue - #5756

I think "eventual consistency" doesn't work here. Because karpenter needs to know correct ephemeral storage size of the host before launching. Otherwise it will not launch any instance if your unschedulable pod requests 150Gi ephemeral-storage.

ilkinmammadzada commented 4 months ago

You may notice region difference (us-west-2 and us-east-1). Their configurations are identical.

@engedaam as I mentioned above they are identical, only difference is region.

You may notice region difference (us-west-2 and us-east-1). Their configurations are identical.

engedaam commented 4 months ago

There are several NodePools, but they are mutually exclusive (xxx.com/pool). I would like to understand why karpenter wanted to launch only following "m, i, r, x" and "32xl, metal" :

Can you also share your deployment? As Karpenter is responding to the pending pods, it would be good to know the pods that Karpenter was trying to provision this nodeClaim.

ilkinmammadzada commented 4 months ago

fwiw. We had ~700 unschedulable pods during that time. But I found all of them and confirmed that none of them has any specific requirement (like instance-type, instance-category and etc).

jonathan-innis commented 4 months ago

@ilkinmammadzada I want to take a step out here to understand a bit better why you are wanting spot over on-demand here and why not just constrain the NodePool to only support spot instance types? My assumption based on the information that you reported is that we are scheduling too many pods to a single node, which is causing the only capacity that's available to be on-demand.

I'm assuming that this is a concern to you because it would be cheaper to have two, smaller spot nodes that schedule all of the capacity vs. having a single on-demand node to schedule all of the capacity.

ilkinmammadzada commented 4 months ago

@jonathan-innis we have included on-demand to prevent an outage due to spot instance capacity problems. We were thinking that karpenter will launch on-demand only if there is no spot instance available. (If your Karpenter NodePool specifies allows both Spot and On-Demand capacity, Karpenter will fallback to provision On-Demand capacity if there is no Spot capacity available. - https://karpenter.sh/docs/faq/#what-if-there-is-no-spot-capacity-will-karpenter-use-on-demand)

Yes, you are right, our concern is only cost. We think it was possible to launch some spot ec2s to repond those unschedulable pods.

jigisha620 commented 4 months ago

There are several NodePools, but they are mutually exclusive (xxx.com/pool). I would like to understand why karpenter wanted to launch only following "m, i, r, x" and "32xl, metal" :

From the nodeClaim that you shared earlier, the workload needs - allocatable: cpu: 125610m ephemeral-storage: 89Gi memory: 484033846Ki pods: "200" vpc.amazonaws.com/pod-eni: "107"

Upon filtering the instance-family(provided in the nodepool) and the available offerings that satisfy these constraints, we are left with instance-types "m, i, r, x" and "32xl, metal". Karpenter further filters these instance types to remove metal if more appropriate instance types are available.

jigisha620 commented 4 months ago

Did you also see any logs mentioning anything about "UnfulfillableCapacity"? Such instance types would also be dropped.

jonathan-innis commented 4 months ago

Upon filtering the instance-family(provided in the nodepool) and the available offerings that satisfy these constraints, we are left with instance-types "m, i, r, x" and "32xl, metal".

Looking at what happened here, I think you may be hitting an issue with Karpenter packing you so tightly that we are using such big instance types that there is no spot availability anymore for these instance types. If your NodePool is flexible to both spot and on-demand, we don't know in our scheduling algorithm that putting you onto on-demand only instance types isn't what you wanted here. The typical way that we tell users to configure fallback is through "weight" since that best communicates to the scheduler that it should stop continuing to pack pods on a node, since it would then fall outside of the capacity type that you want.

Another option here (if you didn't want to strictly break up NodePools into different weights) would be to use the new minValues field that we shipped recently with our requirements block. You could add this field to your node.kubernetes.io/instance-type and make it something like 30 or 40 so that Karpenter never overly constrains your instance types when it is scheduling out your pods.

ilkinmammadzada commented 4 months ago

From the nodeClaim that you shared earlier, the workload needs - allocatable: cpu: 125610m ephemeral-storage: 89Gi memory: 484033846Ki pods: "200" vpc.amazonaws.com/pod-eni: "107"

I think it doesn't exactly mean our workload needed those resources. IIRC launched nodeclaim->allocatable means that how much resources recently launched nodeClaim will have (from karpenter view)

Did you also see any logs mentioning anything about "UnfulfillableCapacity"? Such instance types would also be dropped.

Is it okay to check karpenter_nodeclaims_terminated{reason="insufficient_capacity"} metric or would checking logs be more accurate? Thank you!

ilkinmammadzada commented 4 months ago

The typical way that we tell users to configure fallback is through "weight" since that best communicates to the scheduler that it should stop continuing to pack pods on a node, since it would then fall outside of the capacity type that you want.

Yeah, I was also thinking about it, but didn't try yet. Because I'm seeing that the default-usXXXprod nodepool fails quite enough to provisioner a new ec2 (https://github.com/aws/karpenter-provider-aws/issues/5756). I suspected "spot-default" would fail and "on-demand-default" would be success and we would ended-up having more on-demand than now. Worth to try, I will try.

Another option here (if you didn't want to strictly break up NodePools into different weights) would be to use the new minValues field that we shipped recently with our requirements block. You could add this field to your node.kubernetes.io/instance-type and make it something like 30 or 40 so that Karpenter never overly constrains your instance types when it is scheduling out your pods.

hmm, we started to upgrade karpenter but haven't experimented the new feature. Will try this one as well.

Thank you!

jonathan-innis commented 4 months ago

I suspected "spot-default" would fail and "on-demand-default" would be success and we would ended-up having more on-demand than now.

I don't think this would be the behavior here. It would actually give us more information to use in the scheduler if you used weighting here. We would try the spot NodePool and if we got Insufficient Capacity on the launch, we would go and try to reschedule the pods again, but we would re-try the spot NodePool all over again, but this time with smaller instance types since we know that we couldn't get the bigger ones. We would only fallback to on-demand once we had no more options for spot on that first NodePool.

If you are using a single NodePool with fallback, the behavior is a bit different a little more unexpected. In that case, you get the behavior that you are seeing where we will keep packing pods and constrain down without considering the fact that we may remove instance types that will make it so that we are only launching on-demand.

ilkinmammadzada commented 4 months ago

@jonathan-innis thanks a lot for the great explanation! I will try this week

jonathan-innis commented 4 months ago

Is it okay to check karpenter_nodeclaims_terminated{reason="insufficient_capacity"} metric or would checking logs be more accurate?

I think that the logs is probably the best way. Can you confirm that you saw InsufficientCapacity errors around this time? We should log the offering that we are removing inside of the DEBUG logging. If you are just running INFO, you won't see it but if you have DEBUG, you should look for a log like "removing offering from offerings".

jonathan-innis commented 3 months ago

After discussing a bit more offline, I want to summarize the issues that we think that we are hitting here:

NodePools that support both on-demand and spot capacity types are subject to situations where Karpenter can over-pack nodes with only a few spot instance types available OR to situations where Karpenter packs pods so that only on-demand instance types remain. In these cases, Karpenter will go to the CloudProvider and Create only with on-demand instance types, resulting in the unwanted fallback behavior (in reality, you would have much-preferred two, smaller spot nodes here). The solve for this is the [suggestion that's shown above](). Create two different NodePools, one that's spot and a higher weight, the other that's on-demand and a lower weight.
Karpenter is currently caching information on any value that's using resource.Quantity incorrectly. This includes the volumeSize field in the blockDeviceMappings as well as the value fields in the kubeReserved and systemReserved portions of the KubeletConfiguration. The practical impact of this is that NodeClasses that are similar but not the same may get returned from the instance type cache that we use to save memory for the number of instance type structs that can be stored in memory at any given time. This was fixed in #5816 and will be back-ported in patch releases all the way back to the v0.32.x minor version.
The instanceStorePolicy caching issue (#5756) may also have been a culprit here since that may also have been returning inconsistent information across NodePools/NodeClasses.

ilkinmammadzada commented 3 months ago

Thanks @jonathan-innis ! we upgraded to the new version 0.35.2 and it fixed the problem

aws / karpenter-provider-aws

Karpenter launches on-demand instances instead of spot #5743

Description