equinix / terraform-equinix-metal-anthos-on-vsphere

[Deprecated] Automated Anthos Installation via Terraform for Equinix Metal with vSphere
https://registry.terraform.io/modules/equinix/anthos-on-vsphere/metal/latest
Apache License 2.0
62 stars 41 forks source link

anthos 1.5.0-gke.27 errors in terraform apply #108

Closed dfong closed 2 weeks ago

dfong commented 3 years ago

i have been unable to to get anthos 1.5.0-gke.27 to pass "terraform apply" without errors.

here are some of the error messages from the log.

        null_resource.anthos_deploy_cluster[0] (remote-exec): null_resource.anthos_deploy_cluster (remote-exec): [K    - [FATAL] Hosts for AntiAffinityGroups: Anti-affinity groups enabled with available
        null_resource.anthos_deploy_cluster[0] (remote-exec): [0m[0mnull_resource.anthos_deploy_cluster (remote-exec): Some validation results were FATAL. Check report above.
        null_resource.anthos_deploy_cluster[0] (remote-exec): [0m[0mnull_resource.anthos_deploy_cluster (remote-exec): Failed to create root cluster: unable to create node Machine Deployments: creating or updating machine deployment "gke-admin-node" in namespace "default": timed out waiting for the condition
        null_resource.anthos_deploy_cluster[0] (remote-exec): [0m[0mnull_resource.anthos_deploy_cluster (remote-exec): error: stat /home/ubuntu/cluster/kpresubmit-500-kubeconfig: no such file or directory
        null_resource.anthos_deploy_cluster[0] (remote-exec): [1m[31mError: [0m[0m[1merror executing "/tmp/terraform_522095460.sh": Process exited with status 1[0m
PsychoSid commented 3 years ago

I too have the same problem with 1.5.0

Looking at what I can from the admin workstation logs they show:-

I1021 06:53:15.927719    3118 spinner.go:125] Creating node Machines in internal cluster
I1021 06:53:15.932395    3118 clusterclient.go:886] Waiting for machine deployment "default/gke-admin-node" to to be ready, with retry interval "30s" and timeout "45m0s"
I1021 06:53:15.933959    3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: it hasn't yet been seen by controller (observed generation 0 < generation 1)
I1021 06:53:45.935671    3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 0/2 replicas are ready
I1021 06:54:15.935808    3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 0/2 replicas are ready
I1021 06:54:45.935620    3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 0/2 replicas are ready
I1021 06:55:15.935902    3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 1/2 replicas are ready
I1021 06:55:45.935870    3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 1/2 replicas are ready

The second machine doesn't become ready.

From a describe on the machine objects on the admin cluster I see 1 of the 3 objects not ready:-

ubuntu@admin-workstation:~/cluster$ kubectl --kubeconfig kubeconfig get machine
NAME
gke-admin-master-kn477
gke-admin-node-87d6b48b6-5jcvr
gke-admin-node-87d6b48b6-j294z

With

API Version:  cluster.k8s.io/v1alpha1
Kind:         Machine
Metadata:
  Creation Timestamp:  2020-10-21T06:53:20Z
  Finalizers:
    machine.cluster.k8s.io
  Generate Name:  gke-admin-node-87d6b48b6-
  Generation:     1
  Owner References:
    API Version:           cluster.k8s.io/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  MachineSet
    Name:                  gke-admin-node-87d6b48b6
    UID:                   34d6e535-248d-4a1c-b7f5-d805e5007098
  Resource Version:        104304
  Self Link:               /apis/cluster.k8s.io/v1alpha1/namespaces/default/machines/gke-admin-node-87d6b48b6-5jcvr
  UID:                     a0cbae48-7233-402c-9e19-1b2a9b753658
Spec:
  Anti Affinity Group:  .gke-admin-node-87d6b48b6-4hzxxp
  Metadata:
    Creation Timestamp:  <nil>
  Provider Spec:
    Value:
      API Version:  vsphereproviderconfig.k8s.io/v1alpha1
      Kind:         VsphereMachineProviderConfig
      Machine Variables:
        Datacenter:     Packet
        Datastore:      datastore1
        disk_label:     disk0
        disk_size:      40
        Folder:
        Memory:         16384
        Network:        VM Private Net
        num_cpus:       4
        resource_pool:  Packet-1/Resources/Anthos
        vm_template:    gke-on-prem-ubuntu-1.5.0-gke.27
      Metadata:
        Creation Timestamp:  <nil>
      Network Spec:
        Address:        <nil>
        Dns:            <nil>
        Ntp:
        Use IPAM:       false
      Vsphere Machine:  standard-node
  Versions:
    Kubelet:  1.17.9-gke.4400
Status:
  Failure Domain:  host-10
  Last Updated:    2020-10-21T08:27:29Z
  Phase:           Creating
  Provider Status:
  State:  Unavailable
Events:
  Type    Reason               Age                     From                Message
  ----    ------               ----                    ----                -------
  Normal  Powering on machine  4m7s (x13394 over 93m)  vsphere-controller  Powering on machine gke-admin-node-87d6b48b6-5jcvr

Nodes are:-

ubuntu@admin-workstation:~/cluster$ kubectl --kubeconfig kubeconfig get nodes
NAME                             STATUS   ROLES    AGE    VERSION
gke-admin-master-kn477           Ready    master   108m   v1.17.9-gke.4400
gke-admin-node-87d6b48b6-j294z   Ready    <none>   106m   v1.17.9-gke.4400
displague commented 3 years ago

We may need to open a new issue to update the installation scripts to follow the new gkectl based instructions offered in https://cloud.google.com/anthos/gke/docs/on-prem/1.5/how-to/install-landing

dfong commented 3 years ago

@displague, are you able to get "terraform apply" to work with anthos 1.5.0 ? has anyone gotten it to work?

gfthybridlabs commented 3 years ago

I didn't with 1.5.0 with that 1.5.1 works OK most of the time. Sometimes it gets stuck but as I tend to apply/destroy daily I catch transient issues more than most.

dfong commented 3 years ago

@gfthybridlabs, thanks for the tip! i will give 1.5.1-gke.8 a try.

parkitibabu commented 3 years ago

Similar issues were encountered with Anthos GKE on-prem versions: 1.4.3-gke.3, 1.5.2-gke.3, and 1.5.1-gke.8. I was unable to bring up the machine nodes which failed to create always with all the above versions.

Seeking help here.

dfong commented 3 years ago

@parkitibabu , thanks for sharing your data. did you make any subsequent progress?

dfong commented 3 years ago

i am giving up on anthos 1.5.0-gke.27, which i never got to work.

however, i did get anthos 1.5.1-gke.8 to work, with caveats:

this with the current latest rev of the repo, b569d4c .