Closed rolindroy closed 1 year ago
I have a similar problem, even using topology spread all pods are allocated in the same machine:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web-t6
namespace: test
spec:
selector:
matchLabels:
app: nginx
serviceName: "nginx"
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: nginx
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: nginx
Karpenter: v0.6.1
Kubernetes: v1.20.11
Karpenter should place the nodes in multiple subnets
Thank you Rolind. If I understand this correctly, the subnet selector will instruct Karpenter which subnets are averrable (in this case it detected all three subnets correctly). Based on the averrable subnets, Karpenter computes which instance types can be used and how to override the launch template. However, the scheduling logic does not automatically spread the pods across available zones. Instead, Karpenter relies on topology spread constraints to achieve that.
I have a similar problem, even using topology spread all pods are allocated in the same machine:
Hi Fabiano, this is indeed strange. Can you also share the Karpenter logs and your provisioner config?
I have a similar problem, even using topology spread all pods are allocated in the same machine:
We've seen issues with the default scheduler where this can happen if capacity is already available. If you have a single node in a cluster, and deploy 3 pods with hostname/topology spread, the kube scheduler will "spread" across existing nodes if there's room. If the node has room for all 3 pods, they'll happily all schedule there. Karpenter knows the possible zones and will force spread them during provisioning, but we can't control the kube scheduler.
All the nodes were placed in the same subnet and same az (u-west-1a).
When using spot, Karpenter will choose the cheapest instance type. In this case, it looks like us-west-1a was the cheapest.
I have a similar problem, even using topology spread all pods are allocated in the same machine:
Hi Fabiano, this is indeed strange. Can you also share the Karpenter logs and your provisioner config?
Sure, some details, there were a single node on the cluster, provisioned by eks managed node groups.
And here how the pods were allocated:
As noted by @ellistarn my issue is not related to karpenter, but the way the kubernetes scheduler works. The documentation about topology constraints point this limitation: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#known-limitations.
Will karpenter be able to circunvent this limitation in the future?
I don't know of a path forward to circumvent the kube scheduler, beyond becoming a custom scheduler.
Podantiaffinity will help with this, as the spec doesn't allow for multiple pods per topology key, where topology spread does.
If it were possible to add a "max" to topology, instead of "max skew", this would solve the problem as well, but it would need to happen upstream.
If you change the scheduler name in the pod spec to something other than default, it will skip the kube scheduler and karpenter will keep working. However, karpenter doesn't reuse existing capacity (it relies on kube scheduler), so you will always get new nodes for these pods in this configuration. It's definitely a bit of a hack.
Is karpenter aware of faulty AZs? One of the big challenges with topology constraints and faulty AZs is that both ASGs and kube-scheduler will want to launch instances/launch workloads in faulty AZs.
I don't think you would want to temporarily avoid spread due to a faulty AZ. You sacrifice static stability if you try to detect outages and evacuate. If we can't get capacity for a pod that wants to run in AZ, we need to keep retrying until it succeeds. This is how the kubescheduler works as well.
Sort of. When there is an outage, the nodes in the faulty AZ goes away and the scheduler can carry on. If karpenter still insists on using a faulty AZ, it'll make the situation worse.
Labeled for closure due to inactivity in 10 days.
Re-opening this because I think this might be unexpected. I have a brand new cluster setup with Karpenter running in Fargate. All nodes are spun-up in a Single AZ, all on the Same Instance type (which is not great for spot diversification).
This is my infra:
import { Stack, IResource, StackProps, aws_eks as eks } from 'aws-cdk-lib';
import * as blueprints from '@aws-quickstart/eks-blueprints';
import { Construct } from 'constructs';
import { IVpc } from 'aws-cdk-lib/aws-ec2';
export interface EksLabStackProps extends StackProps {
vpc: IVpc;
}
export class EksLabStack extends Stack {
constructor(scope: Construct, id: string, props: EksLabStackProps) {
super(scope, id, props);
const clusterProvider = new blueprints.GenericClusterProvider({
version: eks.KubernetesVersion.V1_21,
fargateProfiles: {
karpenter: {
fargateProfileName: 'karpenter',
selectors: [{ namespace: 'karpenter' }],
},
},
});
const addOns: Array<blueprints.ClusterAddOn> = [
new blueprints.addons.AwsLoadBalancerControllerAddOn(),
new blueprints.addons.CalicoOperatorAddOn(),
new blueprints.addons.CoreDnsAddOn(),
new blueprints.addons.KubeProxyAddOn(),
new blueprints.addons.MetricsServerAddOn(),
new blueprints.addons.VpcCniAddOn(),
new blueprints.addons.KarpenterAddOn({
amiFamily: 'AL2',
provisionerSpecs: {
'karpenter.sh/capacity-type': ['spot'],
"kubernetes.io/arch": ["amd64","arm64"],
"topology.kubernetes.io/zone": ["eu-west-1a", "eu-west-1b"]
},
subnetTags: {
'karpenter.sh/discovery': 'eks-lab',
},
securityGroupTags: {
'karpenter.sh/discovery': 'eks-lab',
},
}),
];
const resourceProviders = new Map<string,blueprints.ResourceProvider<IResource>>([
[
blueprints.GlobalResources.Vpc,
new blueprints.DirectVpcProvider(props.vpc),
],
]);
new blueprints.EksBlueprint(
this,
{
addOns,
clusterProvider,
resourceProviders,
id: 'eks-lab',
},
props
);
}
}
Result:
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
fargate-ip-10-100-36-0.eu-west-1.compute.internal Ready <none> 21h v1.21.9-eks-14c7a48 10.100.36.0 <none> Amazon Linux 2 4.14.281-212.502.amzn2.x86_64 containerd://1.4.13
ip-10-100-17-193.eu-west-1.compute.internal Ready <none> 24h v1.21.12-eks-5308cf7 10.100.17.193 <none> Amazon Linux 2 5.4.204-113.362.amzn2.x86_64 containerd://1.4.13
ip-10-100-23-20.eu-west-1.compute.internal Ready <none> 24h v1.21.12-eks-5308cf7 10.100.23.20 <none> Amazon Linux 2 5.4.204-113.362.amzn2.x86_64 containerd://1.4.13
ip-10-100-23-21.eu-west-1.compute.internal Ready <none> 24h v1.21.12-eks-5308cf7 10.100.23.21 <none> Amazon Linux 2 5.4.204-113.362.amzn2.x86_64 containerd://1.4.13
kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-apiserver calico-apiserver-5d4577557c-66tm7 1/1 Running 0 24h 10.100.18.219 ip-10-100-23-21.eu-west-1.compute.internal <none> <none>
calico-apiserver calico-apiserver-5d4577557c-jjxb6 1/1 Running 0 24h 10.100.22.55 ip-10-100-17-193.eu-west-1.compute.internal <none> <none>
calico-operator tigera-operator-57b5454687-z7jm7 1/1 Running 0 24h 10.100.23.21 ip-10-100-23-21.eu-west-1.compute.internal <none> <none>
calico-system calico-kube-controllers-57f88bc9fd-g2pjm 1/1 Running 0 24h 10.100.16.132 ip-10-100-17-193.eu-west-1.compute.internal <none> <none>
calico-system calico-node-d5kfv 1/1 Running 0 24h 10.100.23.21 ip-10-100-23-21.eu-west-1.compute.internal <none> <none>
calico-system calico-node-vs6bc 1/1 Running 0 24h 10.100.23.20 ip-10-100-23-20.eu-west-1.compute.internal <none> <none>
calico-system calico-node-wmdrr 1/1 Running 0 24h 10.100.17.193 ip-10-100-17-193.eu-west-1.compute.internal <none> <none>
calico-system calico-typha-5857f899bd-v94bz 1/1 Running 0 24h 10.100.17.193 ip-10-100-17-193.eu-west-1.compute.internal <none> <none>
calico-system calico-typha-5857f899bd-zfd2p 1/1 Running 0 24h 10.100.23.21 ip-10-100-23-21.eu-west-1.compute.internal <none> <none>
karpenter blueprints-addon-karpenter-7bb874498-n8kmj 2/2 Running 0 21h 10.100.36.0 fargate-ip-10-100-36-0.eu-west-1.compute.internal <none> <none>
kube-system aws-load-balancer-controller-7cb845b549-t79qw 1/1 Running 0 22h 10.100.30.97 ip-10-100-23-20.eu-west-1.compute.internal <none> <none>
kube-system aws-load-balancer-controller-7cb845b549-xcm7b 1/1 Running 0 22h 10.100.29.92 ip-10-100-17-193.eu-west-1.compute.internal <none> <none>
kube-system aws-node-5d2bw 1/1 Running 0 24h 10.100.17.193 ip-10-100-17-193.eu-west-1.compute.internal <none> <none>
kube-system aws-node-gjftc 1/1 Running 0 24h 10.100.23.21 ip-10-100-23-21.eu-west-1.compute.internal <none> <none>
kube-system aws-node-ngtzj 1/1 Running 0 24h 10.100.23.20 ip-10-100-23-20.eu-west-1.compute.internal <none> <none>
kube-system blueprints-addon-metrics-server-c758cc974-dwr58 1/1 Running 0 22h 10.100.29.4 ip-10-100-23-20.eu-west-1.compute.internal <none> <none>
kube-system coredns-7cc879f8db-2hjl5 1/1 Running 0 21h 10.100.23.66 ip-10-100-23-20.eu-west-1.compute.internal <none> <none>
kube-system coredns-7cc879f8db-fx2wh 1/1 Running 0 21h 10.100.25.232 ip-10-100-17-193.eu-west-1.compute.internal <none> <none>
kube-system kube-proxy-fz2gg 1/1 Running 0 24h 10.100.17.193 ip-10-100-17-193.eu-west-1.compute.internal <none> <none>
kube-system kube-proxy-mmtzq 1/1 Running 0 24h 10.100.23.21 ip-10-100-23-21.eu-west-1.compute.internal <none> <none>
kube-system kube-proxy-twgjr 1/1 Running 0 24h 10.100.23.20 ip-10-100-23-20.eu-west-1.compute.internal <none> <none>
All instances are of type m5.large
and landed in the same AZ all-though 2 AZ's are discovered.
controller.node-state Discovered subnets: [subnet-0058d68fa0e6c93fa (eu-west-1a) subnet-0ed7d9dc5dac8e058 (eu-west-1b)]
aws ec2 describe-instances | jq '.Reservations[].Instances[] | {Type: .InstanceType, DNS: .PrivateDnsName, AZ: .Placement.AvailabilityZone}'
{
"Type": "m5.large",
"DNS": "ip-10-100-23-21.eu-west-1.compute.internal",
"AZ": "eu-west-1a"
}
{
"Type": "m5.large",
"DNS": "ip-10-100-17-193.eu-west-1.compute.internal",
"AZ": "eu-west-1a"
}
{
"Type": "m5.large",
"DNS": "ip-10-100-23-20.eu-west-1.compute.internal",
"AZ": "eu-west-1a"
}
This is similar to https://github.com/aws/karpenter/issues/1810 . Kubernetes provides native methods to indicate which workloads you want to spread across AZs. You'll need to add topology spread constraints to your workloads. See https://karpenter.sh/v0.13.2/tasks/scheduling/#topology-spread and https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/ .
Hello guys, I've tried several combinations of PodAffinity/AntiAffinity or TopologySpread as @tzneal mentioned, trying to keep pods away from each other in nodes and in AZs. Karpenter discovered my 3 subnets BTW.
But in every single test round (blasting 20 "inflate" pods) at the cluster, I got the same behavior described by @rolindroy: Karpenter will provision enough nodes to cope with the demand but they are all in the same subnet/zone. In every test round, one different subnet was chosen but then all 4~5 nodes were created in it (in only one subnet/AZ). I think, for the sake of availability, that these new nodes should be spread over the subnets/AZs.
Regarding Affinity and Topology Spread, anyone has a working configuration example to share?
This actual behavior on Karpenter is the expected one? I think we should be able to choose how it spreads the nodes. Kubernetes Scheduler will use whatever nodes are available respecting options like the Affinity and Topology, so it is Karpenter's job to provision them over different subnets/AZs if we tell it to do so.
Regards!
Hello guys, I've tried several combinations of PodAffinity/AntiAffinity or TopologySpread as @tzneal mentioned, trying to keep pods away from each other in nodes and in AZs. Karpenter discovered my 3 subnets BTW.
But in every single test round (blasting 20 "inflate" pods) at the cluster, I got the same behavior described by @rolindroy: Karpenter will provision enough nodes to cope with the demand but they are all in the same subnet/zone. In every test round, one different subnet was chosen but then all 4~5 nodes were created in it (in only one subnet/AZ). I think, for the sake of availability, that these new nodes should be spread over the subnets/AZs.
Regarding Affinity and Topology Spread, anyone has a working configuration example to share?
This actual behavior on Karpenter is the expected one? I think we should be able to choose how it spreads the nodes. Kubernetes Scheduler will use whatever nodes are available respecting options like the Affinity and Topology, so it is Karpenter's job to provision them over different subnets/AZs if we tell it to do so.
Regards!
Guys, new findings. Got help from a colleague digging the docs (tks Alex!).
I've tested the deployment with Topology Spread Constraints, using the topologyKey "topology.kubernetes.io/zone" but with the parameter whenUnsatisfiable value set to "DoNotSchedule" and this did the trick for me. Using "ScheduleAnyway" value there doesn't drive Karpenter to try to spread nodes. At the "kubernetes.io/hostname" topologyKey I left it with "ScheduleAnyway".
In the Karpenter docs, this section (https://karpenter.sh/v0.16.3/tasks/scheduling/#topology-spread) should be more specific regarding this topic. The definitions on Kubernetes docs do not make this clear also: (https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#spread-constraint-definition).
I've tested the deployment with Topology Spread Constraints, using the topologyKey "topology.kubernetes.io/zone" but with the parameter whenUnsatisfiable value set to "DoNotSchedule" and this did the trick for me. Using "ScheduleAnyway" value there doesn't drive Karpenter to try to spread nodes. At the "kubernetes.io/hostname" topologyKey I left it with "ScheduleAnyway".
I expect Karpenter to balance number of nodes across AZs without changing pod specs though, both during node provisioning and consolidation. Can this be included in the roadmap?
Please reopen this issue?
All the nodes were placed in the same subnet and same az (u-west-1a).
When using spot, Karpenter will choose the cheapest instance type. In this case, it looks like us-west-1a was the cheapest.
Is it possible to use different provisioning logic, just like expander with Cluster Autoscaler? CA even allows user to list the expanders to use in descending order of importance.
At least, this should be stated clearly in the doc, that Karpenter choose the cheapest instance type when using spot instance, also that Karpenter does not consider AZ balancing, users have to configure pod topology spread with whenUnsatisfiable: DoNotSchedule
property to ensure AZ spread.
there ain't lots of place it is referred, but this bit of the FAQ covers it
At least, this should be stated clearly in the doc, that Karpenter choose the cheapest instance type when using spot instance https://karpenter.sh/v0.19.2/faq/#how-does-karpenter-dynamically-select-instance-types
Karpenter recently changed from capacity-optimized-prioritized
to price-capacity-optimized
re: topology, I agree a better job could be done here. Me too when I started using Karpenter was surprised why it wasn't scheduling pods in different AZs. But Karpenter doesn't know your workloads, doesn't know if they are sensitive to AZ costs for ex. So it's up to the operator to spec the jobs to be topology aware or not.
And once that is coded, kube scheduler will place them accordingly.
/reopen
please - I get the argument on pricing, however often the spot price will be identical across zones, so in that circumstance I'd expect Karpenter to bias a zone spread as a default.
Thanks for continuing to explore this issue. I could see an argument for providing a configuration knob at the provisioner level. e.g.
kind: Provisioner
spec:
requirements: ...
topologySpreadConstraints: # Spread nodes
We'd need to think through the implications of it, though.
We definitely can't enable spread for spot by default, as it goes against Karpenter's cost optimization goals.
I would agree, I still want to be able to have the provisioner spread my nodes across AZs even if using spot instances. I still want the cost benefits of spot instances but to also have them evenly provisioned across AZs.
i agree with @nparfait
@rolindroy @junowong0114 @nparfait Have you tried looking at https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#cluster-level-default-constraints to define cluster-wide topologySpreadConstraints that could spread the pods across the nodes?
The concern with doing implied toplogySpreadConstraints
without the workload requirement is that there is no hard requirement for the kube-scheduler to schedule those pods across topologies which means that we could launch nodes across topologies (say 3 nodes across 3 domains) and then the kube-scheduler binds all pods to two nodes, meaning that one is empty and we would deprovision one of the nodes in one of those domains.
@rolindroy @junowong0114 @nparfait Have you tried looking at https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#cluster-level-default-constraints to define cluster-wide topologySpreadConstraints that could spread the pods across the nodes?
The concern with doing implied
toplogySpreadConstraints
without the workload requirement is that there is no hard requirement for the kube-scheduler to schedule those pods across topologies which means that we could launch nodes across topologies (say 3 nodes across 3 domains) and then the kube-scheduler binds all pods to two nodes, meaning that one is empty and we would deprovision one of the nodes in one of those domains.
This isn't to do with pod scheduling. This is do with nodes being placed in different AZs when multiple spot instances are provisioned
in different AZs when multiple spot instances are provisioned
What's the reason you want to spread them this way? Is it to reduce blast radius for AZ outage? To reduce the chance of being reclaimed? Why do you want this at the node level and not at the application-level (which is generally where I assume that the resiliency requirement needs to lie)
in different AZs when multiple spot instances are provisioned
What's the reason you want to spread them this way? Is it to reduce blast radius for AZ outage? To reduce the chance of being reclaimed? Why do you want this at the node level and not at the application-level (which is generally where I assume that the resiliency requirement needs to lie)
I do have my pods spread across different nodes in different AZs. The issue here was i had karpenter provision 2 spot nodes in the same AZ instead of across 2 AZs.
I do have my pods spread across different nodes in different AZs
I'm confused then. Do your workloads have topologySpreadConstraints
with the topologyKey
set to be topology.kubernetes.io/zone
and a maxSkew
of 1? If so, Karpenter should spread these pods across zones and start provisioning nodes evenly across zones.
@rolindroy @junowong0114 @nparfait Have you tried looking at https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#cluster-level-default-constraints to define cluster-wide topologySpreadConstraints that could spread the pods across the nodes?
The concern with doing implied
toplogySpreadConstraints
without the workload requirement is that there is no hard requirement for the kube-scheduler to schedule those pods across topologies which means that we could launch nodes across topologies (say 3 nodes across 3 domains) and then the kube-scheduler binds all pods to two nodes, meaning that one is empty and we would deprovision one of the nodes in one of those domains.
I have a set of test environment workloads that are in different namespaces that I'd like to have spread across multiple AZs so that I can avoid resource exhaustion. See #2921
The kube topologySpreadConstraints was enhanced with minDomains so that it can spread the pods across multiple AZs even when no nodes exist in a AZ yet. However, it only does this within a specific namespace. I don't have enough pods within each namespace to get effective spread.
Since this is a test environment, I'm not concerned about high availability of any one test env. At most, I am hoping for quick failover in the event of a zone failure. Having some existing capacity in other zones would help. I'm also hoping that some AZ spread would help mitigate the IP exhaustion that can occur when all nodes are in a single zone/subnet.
Improvements towards IP address exhaustion in #2921 would reduce my desire for this feature.
You can apply the same label to multiple different deployments and use a topology spread across that instead:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spread: myspread
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres-deployment
labels:
app: postgres
spec:
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spread: myspread
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
spread: myspread
I don't expect we'll start intentionally putting nodes in different AZs unless requested via scheduling constraints on the workloads themselves.
@jonathan-innis It is a know limitation of topology spread constraints that the set of available values of a target topology, e.g. topology.kubernetes.io/zone
, is not known by the scheduler, c.f. https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#known-limitations
Thus, if there are nodes provisioned in one availability zone only, the scheduler will happily schedule a new pod, as the number of total pods in each topology domain minus the minimum number of pods over all topology domains is zero, in this case.
To overcome the problem, you must somehow control how nodes are provisioned across AZs before the scheduler starts his job on scheduling particular pods. More precisely, you need to have a strategy that provides all required values for the label topology.kubernetes.io/zone
before requesting pod scheduling from the scheduler.
Putting all together, I'm not sure if Karpenter can solve this more general problem, while I agree that Karpenter should allow for provisioning across different AZs if requested, even when requesting SPOT instances.
Thanks @midu-git , what do you use then, cluster autoscaler Anyways, why is this issue close then? Shouldn't it be at least documented as a limitation then?
Version
Karpenter: v0.5.6
Kubernetes: v1.21.4
Expected Behavior
Karpenter should place the nodes in multiple subnets
Actual Behavior
Karpenter create all the nodes in the same subnet even though It was able to discover all the available subnets using subnet selector.
Resource Specs and Logs
All the nodes were placed in the same subnet and same az (u-west-1a).