epic-gateway / dev-test-environment

Scripts and Vagrant/Ansible config to set up a local EPIC Gateway environment for development and test.
https://www.epic-gateway.org/
2 stars 0 forks source link

fatal: [gwclient]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for file gateway_v1a2_gatewayclass-gwdev.yaml"} #1

Open moyushui opened 3 months ago

moyushui commented 3 months ago

error when vagrant up follow the setup of single node on https://www.epic-gateway.org/quick_start/

TASK [epic : install cert-manager] ***** changed: [gateway]

TASK [epic : Wait for pods to be ready] **** fatal: [gwclient]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for file gateway_v1a2_gatewayclass-gwdev.yaml"}

PLAY RECAP ***** gwclient : ok=72 changed=61 unreachable=0 failed=1 skipped=2 rescued=0 ignored=0

==> gwclient: Removing domain... ==> gwclient: Deleting the machine folder ==> gwclient: An error occurred. The error will be shown after all tasks complete. ^C==> gateway: Waiting for cleanup before exiting... [ERROR]: User interrupted execution ==> gateway: Removing domain... ==> gateway: Deleting the machine folder ==> gateway: An error occurred. The error will be shown after all tasks complete. An error occurred while executing multiple actions in parallel. Any errors that occurred are shown below.

An error occurred while executing the action on the 'gateway' machine. Please handle this error then try again:

Ansible failed to complete successfully. Any error output should be visible above. Please fix these errors and try again.

An error occurred while executing the action on the 'gwclient' machine. Please handle this error then try again:

Ansible failed to complete successfully. Any error output should be visible above. Please fix these errors and try again.

caboteria commented 3 months ago

Sorry about that, let's try some debugging. Just FYI, when you run vagrant up, Vagrant builds two clusters: an EPIC Gateway cluster and a client cluster. It looks like the error is that the client is timing out waiting for the server to finish.

Let's try breaking the process down into two steps: bring up the server first, then the client.

$ vagrant up --no-destroy-on-error gateway

The no-destroy-on-error flag tells Vagrant to not clean up if anything goes wrong so we can figure out what.

If that succeeds then we can bring up the client:

$ vagrant up --no-destroy-on-error gwclient
moyushui commented 3 months ago

Sorry about that, let's try some debugging. Just FYI, when you run vagrant up, Vagrant builds two clusters: an EPIC Gateway cluster and a client cluster. It looks like the error is that the client is timing out waiting for the server to finish.

Let's try breaking the process down into two steps: bring up the server first, then the client.

$ vagrant up --no-destroy-on-error gateway

The no-destroy-on-error flag tells Vagrant to not clean up if anything goes wrong so we can figure out what.

If that succeeds then we can bring up the client:

$ vagrant up --no-destroy-on-error gwclient

thank you for your answer , but when i try

 $ vagrant up --no-destroy-on-error gateway

it end with

TASK [epic : install cert-manager] *********************************************
changed: [gateway]

TASK [epic : Wait for pods to be ready] ****************************************
fatal: [gateway]: FAILED! => {"changed": true, "cmd": ["kubectl", "wait", "-n", "cert-manager", "--for=condition=Ready", "--timeout=30m", "pods", "--all"], "delta": "0:10:00.047542", "end": "2024-05-16 06:04:47.703681", "msg": "non-zero return code", "rc": 1, "start": "2024-05-16 05:54:47.656139", "stderr": "timed out waiting for the condition on pods/cert-manager-557c8c599f-wdrrs\ntimed out waiting for the condition on pods/cert-manager-cainjector-587fd9f888-jzxht\ntimed out waiting for the condition on pods/cert-manager-webhook-6784457987-jbnbp", "stderr_lines": ["timed out waiting for the condition on pods/cert-manager-557c8c599f-wdrrs", "timed out waiting for the condition on pods/cert-manager-cainjector-587fd9f888-jzxht", "timed out waiting for the condition on pods/cert-manager-webhook-6784457987-jbnbp"], "stdout": "", "stdout_lines": []}

after TASK: epic : install cert-manager, i do not known why this happened, i have changed --timeout to 30m. Shall i use longer timeout?

caboteria commented 3 months ago

EPIC uses cert-manager to issue certificates, so installing cert-manager is part of the installation process. In your case something has gone wrong with the cert-manager installation so the pods aren't reaching a ready state.

Starting from your dev-test-environment directory, you can ssh to the gateway by running:

$ vagrant ssh gateway
Last login: Thu May 16 14:06:58 2024 from 192.168.121.1
vagrant@epic-gateway:~$ 

Once you've done that, you can examine the cert-manager pods to see what's gone wrong:

vagrant@epic-gateway:~$ kubectl get pods --namespace=cert-manager 
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-557c8c599f-qhgl6              1/1     Running   0          10m
cert-manager-cainjector-587fd9f888-rtjr5   1/1     Running   0          10m
cert-manager-webhook-6784457987-p4nqk      1/1     Running   0          10m

That's what you should see, but you'll probably see something different. It will probably be useful to "describe" the cert-manager pod, or any pod that's not Running.

vagrant@epic-gateway:~$ kubectl describe pods --namespace=cert-manager cert-manager-557c8c599f-qhgl6
Name:             cert-manager-557c8c599f-qhgl6
Namespace:        cert-manager    
Priority:         0              
Service Account:  cert-manager         
Node:             epic-gateway/192.168.121.152
Start Time:       Thu, 16 May 2024 14:00:36 +0000                                                                      
Labels:           app=cert-manager                                                                                     
                  app.kubernetes.io/component=controller
                  app.kubernetes.io/instance=cert-manager
                  app.kubernetes.io/name=cert-manager
                  app.kubernetes.io/version=v1.7.1                                                                                                                                                                                            
                  pod-template-hash=557c8c599f                                                                         
Annotations:      prometheus.io/path: /metrics                                                                                                                                                                                                
                  prometheus.io/port: 9402                                                                             
                  prometheus.io/scrape: true                                                                           
Status:           Running
... I snipped a bunch of output, but I'd be curious to see what your system looks like.

Thanks for your persistence!

moyushui commented 3 months ago

EPIC uses cert-manager to issue certificates, so installing cert-manager is part of the installation process. In your case something has gone wrong with the cert-manager installation so the pods aren't reaching a ready state.

Starting from your dev-test-environment directory, you can ssh to the gateway by running:

$ vagrant ssh gateway
Last login: Thu May 16 14:06:58 2024 from 192.168.121.1
vagrant@epic-gateway:~$ 

Once you've done that, you can examine the cert-manager pods to see what's gone wrong:

vagrant@epic-gateway:~$ kubectl get pods --namespace=cert-manager 
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-557c8c599f-qhgl6              1/1     Running   0          10m
cert-manager-cainjector-587fd9f888-rtjr5   1/1     Running   0          10m
cert-manager-webhook-6784457987-p4nqk      1/1     Running   0          10m

That's what you should see, but you'll probably see something different. It will probably be useful to "describe" the cert-manager pod, or any pod that's not Running.

vagrant@epic-gateway:~$ kubectl describe pods --namespace=cert-manager cert-manager-557c8c599f-qhgl6
Name:             cert-manager-557c8c599f-qhgl6
Namespace:        cert-manager    
Priority:         0              
Service Account:  cert-manager         
Node:             epic-gateway/192.168.121.152
Start Time:       Thu, 16 May 2024 14:00:36 +0000                                                                      
Labels:           app=cert-manager                                                                                     
                  app.kubernetes.io/component=controller
                  app.kubernetes.io/instance=cert-manager
                  app.kubernetes.io/name=cert-manager
                  app.kubernetes.io/version=v1.7.1                                                                                                                                                                                            
                  pod-template-hash=557c8c599f                                                                         
Annotations:      prometheus.io/path: /metrics                                                                                                                                                                                                
                  prometheus.io/port: 9402                                                                             
                  prometheus.io/scrape: true                                                                           
Status:           Running
... I snipped a bunch of output, but I'd be curious to see what your system looks like.

Thanks for your persistence!

Allright. I found something wrong but i have repaired it: Actually, i find that flannel dose not work because of Warning DNSConfigForming 6d2h (x74 over 6d4h) kubelet Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 4.2.2.1 4.2.2.2 208.67.220.220 Normal SandboxChanged 9m4s kubelet Pod sandbox changed, it will be killed and re-created. Warning Failed 9m2s (x2 over 9m3s) kubelet Error: services have not yet been read at least once, cannot construct envvars so that cert-manager can not find "/run/flannel/subnet.env". and i find that kubelet is using config file "/run/systemd/resolve/resolv.conf" and the content of "/run/systemd/resolve/resolv.conf" has 4 nameserver. and then i delete one, restart flannel and cert-manager, gateway works well now and all pods are ready, but gwclient is still not working because: