[Bug]: Bootstrap: several steps fail due to pod antiAffinity rules

mschnee commented 6 months ago

Prior Search

[X] I have already searched this project's issues to determine if a bug report has already been made.

What happened?

In the latest edge release edge.24-05-23, antiAffinity rules were added to "ensure that pods in the same deployment are not scheduled on the same instance type (not just the same instance) in order to prevent disruption caused by spot instance scale-in."

This unfortunately invalidates the bootstrapping guide as many services do not successfully apply as the desired number of pods cannot be scheduled. The list so far:

cilium-operator
core-dns
vault
cert-manager
(still working through the bootstrapping guide)

I would like to recommend that this instead be configuration that can be changed, potentially at the region level, so that ideal topology and affinity rules can be set once the cluster is bootstrapped.

0/3 nodes are available: 1 node(s) didn't match pod topology spread constraints, 2 node(s) didn't match pod anti-affinity rules.

  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: panfactum.com/class
            operator: In
            values:
            - controller
        weight: 100
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            id: vault-09a3b99a8451a460
        topologyKey: node.kubernetes.io/instance-type

Steps to Reproduce

Follow the bootstrapping guide on a net-new VPC & Cluster.

Version

main (development branch)

Relevant log output

0/3 nodes are available: 1 node(s) didn't match pod topology spread constraints, 2 node(s) didn't match pod anti-affinity rules.

The affinity in question: https://github.com/Panfactum/stack/compare/edge.24-05-15...edge.24-05-23#diff-399e98d14072c9446c6e0ad873ab113356418a363f13830321a05be500ebdbbcR95

And example of it's usage: https://github.com/Panfactum/stack/compare/edge.24-05-15...edge.24-05-23#diff-1c228ec7df95544f30340762e30922fdc8b0ac227e3f13c1ff1e01b29905ac61R343

mschnee commented 6 months ago

Maybe unrelated, but there's an "unsatisfiable topoloy constraint for pod anti-affinity" error affecting karpenter (attempting to scale up cert-manager/cert-manager per docs)

{
  "level": "ERROR",
  "time": "2024-05-29T20:50:13.266Z",
  "logger": "controller.provisioner",
  "message": "Could not schedule pod, incompatible with nodepool \"burstable\", daemonset overhead={\"cpu\":\"435m\",\"memory\":\"958815207\",\"pods\":\"5\"}, unsatisfiable topology constraint for pod anti-affinity, key=node.kubernetes.io/instance-type (counts = r5d.xlarge: 1 c7a.xlarge: 1 and 3 other(s), podDomains = node.kubernetes.io/instance-type Exists, nodeDomains = node.kubernetes.io/instance-type Exists; incompatible with nodepool \"spot\", daemonset overhead={\"cpu\":\"435m\",\"memory\":\"958815207\",\"pods\":\"5\"}, unsatisfiable topology constraint for pod anti-affinity, key=node.kubernetes.io/instance-type (counts = r6idn.xlarge: 1 c6g.2xlarge: 1 and 6 other(s), podDomains = node.kubernetes.io/instance-type Exists, nodeDomains = node.kubernetes.io/instance-type Exists; incompatible with nodepool \"on-demand\", daemonset overhead={\"cpu\":\"435m\",\"memory\":\"958815207\",\"pods\":\"5\"}, unsatisfiable topology constraint for pod anti-affinity, key=node.kubernetes.io/instance-type (counts = c6gn.medium: 1 c3.2xlarge: 1 and 4 other(s), podDomains = node.kubernetes.io/instance-type Exists, nodeDomains = node.kubernetes.io/instance-type Exists",
  "commit": "8b2d1d7",
  "pod": "core-dns/core-dns-576668f9c9-ps75b"
}

{
  "level": "ERROR",
  "time": "2024-05-29T20:50:13.273Z",
  "logger": "controller.provisioner",
  "message": "creating node claim, NodeClaim.karpenter.sh \"burstable-lqnfn\" is invalid: [spec.requirements: Too many: 33: must have at most 30 items, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]",
  "commit": "8b2d1d7"
}

mschnee commented 6 months ago

Fixed in 2d2dd57c3445cf234a9519317eefcea011ff74bb

Panfactum / stack