Open cmluciano opened 2 years ago
I was able to confirm this problem with the latest main
after ~2h of waiting on null_resource.deploy_anthos_cluster
The script was stuck looping on the following error, found in /root/baremetal/cluster_create.log
, the log of /root/bootstrap/create_cluster.sh
execution:
Waiting for cluster to become ready: Internal error occurred: failed calling webhook "vvmruntime.kb.io": failed to call webhook: Post "https://vmruntime-webhook-service.vm-system.svc:443/validate-vm-cluster-gke-io-v1-vmruntime?timeout=10s": dial tcp 172.31.79.252:443: connect: connection refused⠋
root@eqnx-metal-gke-g5klw-cp-01:~# kubectl get ValidatingWebhookConfiguration -A
NAME WEBHOOKS AGE
capi-validating-webhook-configuration 5 91m
cert-manager-webhook 1 92m
clientconfig-admission-webhook 1 92m
clusterdns-webhook 1 92m
net-attach-def-admission-controller-validating-config 1 92m
validating-webhook-configuration 10 91m
validation-webhook.snapshot.storage.k8s.io 1 92m
vmruntime-validating-webhook-configuration 1 92m
root@eqnx-metal-gke-g5klw-cp-01:~# kubectl get all -n vm-system
NAME READY STATUS RESTARTS AGE
pod/vmruntime-controller-manager-67775946fb-f9rlt 0/2 Pending 0 93m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/vmruntime-webhook-service ClusterIP 172.31.79.252 <none> 443/TCP 93m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/vmruntime-controller-manager 0/1 1 0 93m
NAME DESIRED CURRENT READY AGE
replicaset.apps/vmruntime-controller-manager-67775946fb 1 1 0 93m
The pod is failing to start with:
Warning FailedScheduling 3m27s (x90 over 93m) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate.
Thanks for taking a look @displague . Do you think it might be sufficient to just patch the vmruntime-controller-manager to all scheduling on nodes that have this taint or is it a requirement to be responding for some BGP stuff ?
@cmluciano Yes, I think patching should be sufficient to get things started. I'm not sure about the dependencies of vmruntime-controller-manager, but this sounds like a good thing to try first.
I wonder if the upstream Anthos project might consider adding a toleration for node.cloudprovider.kubernetes.io/uninitialized: true
to vmruntime-controller-manager
. Thoughts, @c0dyhi11?
In this case, the vmruntime-controller-manager is awaiting the CloudProvider taint to be cleared, which will not happen until after vmruntime-controller-manager succeeds. 🐔 🥚 .
I am trying to test out Anthos 1.11.2 so that I can leverage some newer features that take advantage of Equinix metal's SRIOV features in the baremetal hardware. My preference is to use centos_8 as the backend and I patched the script to get past some errors I was having (I can send the patch in a PR) but the issue appears to happen on the default ubuntu_20_04 release as well so it doesn't appear to be OS related.
terraform.tfvars
I get up to the
null_resource.kube_vip_install_first_cp
step and it never completes. I've even let it run overnight and it never completes even after 15 hours.Since my values are fairly standard except for the new Anthos, I presume the issue is likely a change with the BGP peering that perhaps hasn't been accounted for.