civo-learn / civo-gpu-operator-tf

2 stars 1 forks source link

talos cluster never reaches Active state #2

Open NerdyShawn opened 1 month ago

NerdyShawn commented 1 month ago

Terraform apply times out when using the default L40s node in LON1. The node seems 🟢 but the v1.27 Talos cluster itself never becomes available enough to pull the kube.config to troubleshoot further.

civo_kubernetes_cluster.cluster: Still creating... [59m51s elapsed]
civo_kubernetes_cluster.cluster: Still creating... [1h0m1s elapsed]

Error: error waiting for cluster (ca9f0a53-12b8-42da-b3be-3ce049d2faef) to be created: timeout while waiting for state to become 'ACTIVE' (last state: 'BUILDING', timeout: 1h0m0s)

  on civo-cluster.tf line 1, in resource "civo_kubernetes_cluster" "cluster":
   1: resource "civo_kubernetes_cluster" "cluster" {

view of the cluster state

# the provisioned talos cluster never reaches a ready state
civo k8s ls --region LON1
+--------------------------------------+-------------------+--------------+-------+-------+-------------------------------------+
| ID                                   | Name              | Cluster-Type | Nodes | Pools | Conditions                          |
+--------------------------------------+-------------------+--------------+-------+-------+-------------------------------------+
| ca9f0a53-12b8-42da-b3be-3ce049d2faef | gpu_operator_civo | talos        |     1 |     1 | All Workers Up: False               |
|                                      |                   |              |       |       | Cluster On Desired Version: Unknown |
|                                      |                   |              |       |       | Control Plane Accessible: Unknown   |
|                                      |                   |              |       |       |                                     |
+--------------------------------------+-------------------+--------------+-------+-------+-------------------------------------+

not able to pull the kube.config

civo k8s config gpu_operator_civo --save --region LON1
Please check if you are using the latest version of CLI and retry the command 
If you are still facing issues, please report it on our community slack or open a GitHub issue (https://github.com/civo/cli/issues) 
Error: The cluster isn't ready yet, so the KUBECONFIG isn't available.

The node looks like its ready but the control plane never reaches a ready status. image

NerdyShawn commented 1 month ago

The cluster itself was still building, then this prevents the terraform destroy and have to cleanup the resources manually.

Cluster stuck on BUILDING

civo k8s show gpu_operator_civo --region LON1
          ID : ca9f0a53-12b8-42da-b3be-3ce049d2faef
        Name : gpu_operator_civo
 ClusterType : talos
      Region : LON1
       Nodes : 1
        Size : an.g1.l40s.kube.x1
      Status : BUILDING
    Firewall : gpu_operator_civo-firewall
     Version : 1.27.0
API Endpoint : https://74.220.19.126:6443
 External IP : 74.220.19.126
DNS A record : ca9f0a53-12b8-42da-b3be-3ce049d2faef.k8s.civo.com

Conditions:
+---------------------------------------+---------+
| Message                               | Status  |
+---------------------------------------+---------+
| Worker nodes from all pools are ready | False   |
+---------------------------------------+---------+
| Cluster is on desired version         | Unknown |
+---------------------------------------+---------+
| Control Plane is accessible           | Unknown |
+---------------------------------------+---------+

Pool (17b25f):
+-----------------------------------------------+----+--------+--------------------+-----------+----------+---------------+
| Name                                          | IP | Status | Size               | Cpu Cores | RAM (MB) | SSD disk (GB) |
+-----------------------------------------------+----+--------+--------------------+-----------+----------+---------------+
| gpu-operator-civo-439b-567483-pool-3286-83hg9 |    | ACTIVE | an.g1.l40s.kube.x1 |        12 |   131072 |           200 |
+-----------------------------------------------+----+--------+--------------------+-----------+----------+---------------+

Labels:
kubernetes.civo.com/node-pool=17b25fcb-1116-415a-bf25-64b9734073b5
kubernetes.civo.com/node-size=an.g1.l40s.kube.x1