Unable to progress past BGP peering step on Anthos 1.11.2

cmluciano commented 2 years ago

I am trying to test out Anthos 1.11.2 so that I can leverage some newer features that take advantage of Equinix metal's SRIOV features in the baremetal hardware. My preference is to use centos_8 as the backend and I patched the script to get past some errors I was having (I can send the patch in a PR) but the issue appears to happen on the default ubuntu_20_04 release as well so it doesn't appear to be OS related.

terraform.tfvars

// this should be your personal token, not the project token
metal_auth_token = "sanitized"
metal_organization_id = "sanitized"
metal_project_id = "sanitized"
// don't create a new project, use an existing
metal_create_project = false
gcp_project_id = "sanitized"
cluster_name = "anthos-metal-1"
// 1.11.X is necessary to get the latest multi-nic pieces for sriov
anthos_version = "1.11.2"
// ideally we want rhel_7 here but saw a couple bugs for rhel
// operating_system = "rhel_8"
// operating_system = "centos_8"
operating_system = "ubuntu_20_04"
facility = "dc13"

I get up to the null_resource.kube_vip_install_first_cp step and it never completes. I've even let it run overnight and it never completes even after 15 hours.

null_resource.kube_vip_install_first_cp (remote-exec): /root/bootstrap/vip.yaml FOUND!
null_resource.kube_vip_install_first_cp (remote-exec): BGP peering initiated! Cluster should be completed in about 5 minutes.
null_resource.kube_vip_install_first_cp: Creation complete after 9m23s [id=7216402651719392522]
***
***
***
null_resource.deploy_anthos_cluster: Still creating... [15h22m26s elapsed]
null_resource.deploy_anthos_cluster: Still creating... [15h22m36s elapsed]
null_resource.deploy_anthos_cluster: Still creating... [15h22m46s elapsed]
null_resource.deploy_anthos_cluster: Still creating... [15h22m56s elapsed]
^CStopping operation...
Interrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...
╷
│ Error: execution halted
│ 
│ Error: remote-exec provisioner error
│ 
│   with null_resource.deploy_anthos_cluster,
│   on main.tf line 239, in resource "null_resource" "deploy_anthos_cluster":
│  239:   provisioner "remote-exec" {
│ 
│ error executing "/tmp/terraform_925104650.sh": wait: remote command exited without exit status or exit signal
╵

Since my values are fairly standard except for the new Anthos, I presume the issue is likely a change with the BGP peering that perhaps hasn't been accounted for.

displague commented 2 years ago

I was able to confirm this problem with the latest main after ~2h of waiting on null_resource.deploy_anthos_cluster

displague commented 2 years ago

The script was stuck looping on the following error, found in /root/baremetal/cluster_create.log, the log of /root/bootstrap/create_cluster.sh execution:

Waiting for cluster to become ready: Internal error occurred: failed calling webhook "vvmruntime.kb.io": failed to call webhook: Post "https://vmruntime-webhook-service.vm-system.svc:443/validate-vm-cluster-gke-io-v1-vmruntime?timeout=10s": dial tcp 172.31.79.252:443: connect: connection refused⠋

root@eqnx-metal-gke-g5klw-cp-01:~# kubectl  get ValidatingWebhookConfiguration -A
NAME                                                    WEBHOOKS   AGE
capi-validating-webhook-configuration                   5          91m
cert-manager-webhook                                    1          92m
clientconfig-admission-webhook                          1          92m
clusterdns-webhook                                      1          92m
net-attach-def-admission-controller-validating-config   1          92m
validating-webhook-configuration                        10         91m
validation-webhook.snapshot.storage.k8s.io              1          92m
vmruntime-validating-webhook-configuration              1          92m

root@eqnx-metal-gke-g5klw-cp-01:~# kubectl  get all  -n vm-system
NAME                                                READY   STATUS    RESTARTS   AGE
pod/vmruntime-controller-manager-67775946fb-f9rlt   0/2     Pending   0          93m

NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/vmruntime-webhook-service   ClusterIP   172.31.79.252   <none>        443/TCP   93m

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/vmruntime-controller-manager   0/1     1            0           93m

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/vmruntime-controller-manager-67775946fb   1         1         0       93m

The pod is failing to start with:

Warning FailedScheduling 3m27s (x90 over 93m) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate.

cmluciano commented 2 years ago

Thanks for taking a look @displague . Do you think it might be sufficient to just patch the vmruntime-controller-manager to all scheduling on nodes that have this taint or is it a requirement to be responding for some BGP stuff ?

displague commented 2 years ago

@cmluciano Yes, I think patching should be sufficient to get things started. I'm not sure about the dependencies of vmruntime-controller-manager, but this sounds like a good thing to try first.

I wonder if the upstream Anthos project might consider adding a toleration for node.cloudprovider.kubernetes.io/uninitialized: true to vmruntime-controller-manager. Thoughts, @c0dyhi11?

In this case, the vmruntime-controller-manager is awaiting the CloudProvider taint to be cleared, which will not happen until after vmruntime-controller-manager succeeds. 🐔 🥚 .

equinix / terraform-equinix-metal-anthos-on-baremetal

Unable to progress past BGP peering step on Anthos 1.11.2 #84