Panfactum / stack

The Panfactum Stack
https://panfactum.com
Other
16 stars 5 forks source link

[Bug]: several modules do not deploy/run due to arm64 taint #73

Closed mschnee closed 3 months ago

mschnee commented 4 months ago

Prior Search

What happened?

Several modules fail to deploy and schedule pods due to an arm64 taint. Most noticeably, earliest in the bootstrapping documentation is that the cilium connectivity test command fails to start because its pods do not schedule.

Other modules that fail include:

resource "aws_eks_node_group" "controllers" {
  ...
  taint {
    effect = "NO_SCHEDULE"
    key    = "arm64"
    value  = "true"
  }

We've been able to work around this by manually removing the taint from one of the on-demand nodes, but karpenter-provisioned nodes also start up with this taint.

Steps to Reproduce

Start a net new cluster with bootstrap_mode_enabled = true. The three nodes in the cluster are beta.kubernetes.io/arch=arm64 and the cilium tests cannot be run on them. The pods are unschedulable.

Version

main (development branch)

Relevant log output

0/3 nodes are available: 3 node(s) had untolerated taint {arm64: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
uptownhr commented 4 months ago

I am also running into this right now.

I've tried setting bootstrapping_mode_enabled to false and terragrunt apply didn't detect a change. As I'm going through the guide as a new user setting up, I'm blocked here.

Also, just adding that my cluster only has 2 nodes vs 3.

uptownhr commented 4 months ago

I don't know how taints work but maybe the tolerations created by connectivity test not matching the node taint requirements?

Taints:           arm64=true:NoSchedule                                                                                                                                                                       
│                     burstable=true:NoSchedule                                                                                                                                                                                                                                                                                                                                                   
│                     spot=true:NoSchedule 
uptownhr commented 4 months ago

In my case I think the issue is that I was using bootstrapping_mode_enabled vs bootstrap_mode_enabled. Will make pr to fix the guide https://panfactum.com/docs/edge/guides/bootstrapping/kubernetes-cluster#enable-bootstrapping-mode

mschnee commented 4 months ago

I decided that instead of forging ahead and assuming the networking would work, I should scrap the cluster and restart. Now, the tests run, however I have test failures (though this may be besboke-VPC related)

It may also be worth updating the documentation with something along the lines of "if the pods don't schedule, something has gone terribly wrong and you should start again". So, no longer a bug.

uptownhr commented 4 months ago

@mschnee that is suprising because I believe you're suspicion around the taint is correct. When I modify the cilium client to include the toleration for arm64 the pod did start up.

Given my test results here, I wonder how the cilium clients are being scheduled for you? Can you share the taints from the node and the tolerations from your cilium client?

I also tried manually removing the taints from the node and running the test. The test pods are now running and I'm awaiting results.

uptownhr commented 4 months ago

Reporting back that tests are successful after

  1. utilizing bootstrap_mode_enabled
  2. manually removing the arm64 taint from the nodes
fullykubed commented 4 months ago

In addition to the typo for bootstrap_mode_enabled, the core issue is that the last release changes the EKS cluster to run arm64 nodes (as they are cheaper). As arm64 compatibility cannot be guaranteed by all utilities, we add the arm64 taint.

All the Panfactum IaC modules have the appropriate arm64 tolerations. However, the manifests deployed by cilium connectivity test do not. Additionally, since these run before karpenter is deployed, no amd64 nodes are provisioned.

As a result, we will revert the change so that EKS uses amd64 nodes when bootstrap_mode_enabled is true and arm64 otherwise. That should resolve this issue.

fullykubed commented 3 months ago

Resolved.