ChameleonCloud / chi-in-a-box

Packaging the systems and operations of the Chameleon testbed
Apache License 2.0
14 stars 10 forks source link

k3s playbook not working #225

Open samiemostafavi opened 1 year ago

samiemostafavi commented 1 year ago

Hi,

I am trying to run chi-in-a-box on Ubuntu server 20.0.4. I have only one controller node and I run cc-ansible locally. I use the following site-config/default.yml file (on the first round):

openstack_region_name: CHI@ExPECA
chameleon_site_name: expeca
kolla_internal_vip_address: "10.20.111.254"
kolla_external_vip_address: "10.0.87.254"
neutron_networks:
- name: public
  bridge_name: br-ex
  external_interface: veth-publicb
  cidr: 10.0.87.0/24
  gateway_ip: 10.0.87.1
  allocation_pools:
    - start: 10.0.87.20
      end: 10.0.87.250
- name: physnet1
  bridge_name: br-internal
  external_interface: veth-privateb
  on_demand_vlan_ranges:
    - 200:250
  reservable_vlan_ranges:
    - 251:300
ironic_provisioning_network_vlan: 200
ironic_provisioning_network_cidr: 10.51.0.0/24
enable_k3s: no
enable_zun: yes

As you mention in Edge v2 implementation PR, I have to:

  1. Run a full deploy with enable_k3s set to false.
  2. Set enable_k3s to true.
  3. Run cc-ansible --playbook ./playbooks/k3s.yaml
  4. Run cc-ansible deploy again.

In the first step I get an error:

TASK [zun : Copying over kubeconfig for k8s agent] 
No file was found when using first_found. Use errors='ignore' to allow this task to be skipped if no files are found

Since it is the first round, I make ansible ignore this error and I continue (by explicitly mentioning ignore_errors: yes for this task).

Then I set enable_k3s: yes in the site-config/default.yml and run cc-ansible --playbook ./playbooks/k3s.yaml. I get an error indicating that Kubernetes is not installed.

TASK [k3s : Apply Calico operator]
Failed to import the required Python library (kubernetes) on client's Python /etc/ansible/venv/bin/python

Is there any thing that I am missing?

Best, Samie

msherman64 commented 1 year ago

Hey, I probably missed a step when putting this together, good notes on fixing the "with_first_found" issue.

You should be able to get unstuck by running: /etc/ansible/venv/bin/pip install kubernetes

samiemostafavi commented 1 year ago

I tried that but then in roles/k3s/tasks/config-calico.yml in the task Apply Calico global network policies I get the following error:

kubernetes.core.k8s 'ansible_managed' is undefined"

I think something is wrong with the ansible env.

msherman64 commented 1 year ago

The branch in the linked PR should get you further.

samiemostafavi commented 1 year ago

Thanks for the effort, however it is unable to create Neutron network due to an authentication issue:

TASK [k3s : Create calico network] *************************************************************************************************************************************************************************
fatal: [edge]: FAILED! => {"action": "os_network", "changed": false, "extra_data": {"data": null, "details": "Running without keystone AuthN requires that tenant_id is specified", "response": "{\"NeutronError\": {\"type\": \"HTTPBadRequest\", \"message\": \"Running without keystone AuthN requires that tenant_id is specified\", \"detail\": \"\"}}"}, "msg": "BadRequestException: 400: Client Error for url: http://10.0.87.254:9696/v2.0/networks, Running without keystone AuthN requires that tenant_id is specified"}
samiemostafavi commented 1 year ago

Hi,

After cleaning the prior state, I gave this PR another try. It goes way furthur but stops at the following task:

k3s : Wait till the Tigera Operator has fully applied
fatal: [edge -> localhost]: FAILED! => {"changed": false, "msg": "Failed to gather information about TigeraStatus(s) even after waiting for 123 seconds"}

Any idea?

msherman64 commented 1 year ago

The installation of calico can take a while, and may have encountered some errors? Take a look at the calico debugging guide. https://projectcalico.docs.tigera.io/maintenance/troubleshoot/commands

samiemostafavi commented 1 year ago

Hey,

It seems calico installation fails. Here is the logs that I could collect:

$ kubectl get pods -A -o wide
NAMESPACE         NAME                                       READY   STATUS             RESTARTS        AGE     IP            NODE      NOMINATED NODE   READINESS GATES
kube-system       local-path-provisioner-64ffb68fd-g6gwf     1/1     Running            0               9m50s   12.48.124.2   edge-mv   <none>           <none>
kube-system       metrics-server-9cf544f65-vxvhl             1/1     Running            0               9m50s   12.48.124.4   edge-mv   <none>           <none>
calico-system     calico-node-9qvtk                          0/1     Running            0               6m13s   192.168.9.3   edge-mv   <none>           <none>
tigera-operator   tigera-operator-6f669b6c4f-nfp59           0/1     CrashLoopBackOff   5 (2m14s ago)   6m28s   192.168.9.3   edge-mv   <none>           <none>
calico-system     calico-typha-645f95cb48-hxpxw              0/1     CrashLoopBackOff   5 (2m1s ago)    6m13s   192.168.9.3   edge-mv   <none>           <none>
kube-system       coredns-85cb69466-wxj5j                    0/1     CrashLoopBackOff   5 (72s ago)     9m50s   12.48.124.5   edge-mv   <none>           <none>
calico-system     calico-kube-controllers-77cf47555c-jwdwr   0/1     CrashLoopBackOff   5 (57s ago)     6m13s   12.48.124.3   edge-mv   <none>           <none>
calico-system     csi-node-driver-rs9cs                      0/2     CrashLoopBackOff   10 (57s ago)    5m45s   12.48.124.1   edge-mv   <none>           <none>
$ kubectl logs -n tigera-operator tigera-operator-6f669b6c4f-nfp59
2022/10/07 15:34:09 [INFO] Version: v1.28.1
2022/10/07 15:34:09 [INFO] Go Version: go1.17.9b7
2022/10/07 15:34:09 [INFO] Go OS/Arch: linux/amd64
2022/10/07 15:34:10 [INFO] Active operator: proceeding
{"level":"info","ts":1665156850.8377635,"logger":"setup","msg":"Checking type of cluster","provider":""}
{"level":"info","ts":1665156850.8396735,"logger":"setup","msg":"Checking if PodSecurityPolicies are supported by the cluster","supported":true}
{"level":"info","ts":1665156850.8414762,"logger":"setup","msg":"Checking if TSEE controllers are required","required":false}
{"level":"info","ts":1665156850.9486334,"logger":"typha_autoscaler","msg":"Starting typha autoscaler","syncPeriod":10}
{"level":"info","ts":1665156850.9487607,"logger":"setup","msg":"starting manager"}
I1007 15:34:10.949466       1 leaderelection.go:248] attempting to acquire leader lease tigera-operator/operator-lock...
$ kubectl logs -n calico-system     csi-node-driver-rs9cs
error: a container name must be specified for pod csi-node-driver-rs9cs, choose one of: [calico-csi csi-node-driver-registrar]
$ kubectl logs -n calico-system     calico-kube-controllers-77cf47555c-jwdwr
2022-10-07 15:35:21.948 [INFO][1] main.go 103: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W1007 15:35:21.953519       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2022-10-07 15:35:21.953 [INFO][1] main.go 127: Ensuring Calico datastore is initialized
2022-10-07 15:35:21.971 [INFO][1] main.go 153: Calico datastore is initialized
2022-10-07 15:35:21.972 [INFO][1] main.go 190: Getting initial config snapshot from datastore
2022-10-07 15:35:21.995 [INFO][1] main.go 193: Got initial config snapshot
2022-10-07 15:35:21.995 [INFO][1] watchersyncer.go 89: Start called
2022-10-07 15:35:21.996 [INFO][1] main.go 207: Starting status report routine
2022-10-07 15:35:21.996 [INFO][1] main.go 216: Starting Prometheus metrics server on port 9094
2022-10-07 15:35:21.996 [INFO][1] main.go 493: Starting informer informer=&cache.sharedIndexInformer{indexer:(*cache.cache)(0xc00060a468), controller:cache.Controller(nil), processor:(*cache.sharedProcessor)(0xc00071ca80), cacheMutationDetector:cache.dummyMutationDetector{}, listerWatcher:(*cache.ListWatch)(0xc00060a450), objectType:(*v1.Pod)(0xc000516800), resyncCheckPeriod:0, defaultEventHandlerResyncPeriod:0, clock:(*clock.RealClock)(0x300a280), started:false, stopped:false, startedLock:sync.Mutex{state:0, sema:0x0}, blockDeltas:sync.Mutex{state:0, sema:0x0}, watchErrorHandler:(cache.WatchErrorHandler)(nil), transform:(cache.TransformFunc)(nil)}
2022-10-07 15:35:21.996 [INFO][1] main.go 493: Starting informer informer=&cache.sharedIndexInformer{indexer:(*cache.cache)(0xc00060a4b0), controller:cache.Controller(nil), processor:(*cache.sharedProcessor)(0xc00071caf0), cacheMutationDetector:cache.dummyMutationDetector{}, listerWatcher:(*cache.ListWatch)(0xc00060a498), objectType:(*v1.Node)(0xc000219800), resyncCheckPeriod:0, defaultEventHandlerResyncPeriod:0, clock:(*clock.RealClock)(0x300a280), started:false, stopped:false, startedLock:sync.Mutex{state:0, sema:0x0}, blockDeltas:sync.Mutex{state:0, sema:0x0}, watchErrorHandler:(cache.WatchErrorHandler)(nil), transform:(cache.TransformFunc)(nil)}
2022-10-07 15:35:21.996 [INFO][1] watchersyncer.go 130: Sending status update Status=wait-for-ready
2022-10-07 15:35:21.996 [INFO][1] syncer.go 86: Node controller syncer status updated: wait-for-ready
2022-10-07 15:35:21.996 [INFO][1] watchersyncer.go 149: Starting main event processing loop
2022-10-07 15:35:21.996 [INFO][1] watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/ippools"
2022-10-07 15:35:21.996 [INFO][1] watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2022-10-07 15:35:21.996 [INFO][1] main.go 499: Starting controller ControllerType="Node"
2022-10-07 15:35:21.997 [INFO][1] controller.go 193: Starting Node controller
I1007 15:35:21.997683       1 shared_informer.go:255] Waiting for caches to sync for nodes
2022-10-07 15:35:21.997 [INFO][1] watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/clusterinformations"
2022-10-07 15:35:21.997 [INFO][1] watchercache.go 181: Full resync is required ListRoot="/calico/ipam/v2/assignment/"
2022-10-07 15:35:22.003 [INFO][1] watchercache.go 294: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/clusterinformations"
2022-10-07 15:35:22.003 [INFO][1] watchersyncer.go 130: Sending status update Status=resync
2022-10-07 15:35:22.004 [INFO][1] syncer.go 86: Node controller syncer status updated: resync
2022-10-07 15:35:22.004 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2022-10-07 15:35:22.004 [WARNING][1] hostendpoints.go 96: Unexpected kind received over syncer: ClusterInformation(default)
2022-10-07 15:35:22.007 [INFO][1] watchercache.go 294: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/ippools"
2022-10-07 15:35:22.007 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2022-10-07 15:35:22.007 [WARNING][1] hostendpoints.go 96: Unexpected kind received over syncer: IPPool(default-ipv4-ippool)
2022-10-07 15:35:22.010 [INFO][1] watchercache.go 294: Sending synced update ListRoot="/calico/ipam/v2/assignment/"
2022-10-07 15:35:22.010 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2022-10-07 15:35:22.010 [INFO][1] resources.go 350: Main client watcher loop
2022-10-07 15:35:22.011 [INFO][1] watchercache.go 294: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2022-10-07 15:35:22.012 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2022-10-07 15:35:22.012 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2022-10-07 15:35:22.012 [INFO][1] watchersyncer.go 130: Sending status update Status=in-sync
2022-10-07 15:35:22.012 [INFO][1] syncer.go 86: Node controller syncer status updated: in-sync
2022-10-07 15:35:22.021 [INFO][1] hostendpoints.go 177: successfully synced all hostendpoints
I1007 15:35:22.097977       1 shared_informer.go:262] Caches are synced for nodes
I1007 15:35:22.098020       1 shared_informer.go:255] Waiting for caches to sync for pods
I1007 15:35:22.098050       1 shared_informer.go:262] Caches are synced for pods
2022-10-07 15:35:22.098 [INFO][1] ipam.go 253: Will run periodic IPAM sync every 7m30s
$ kubectl logs -n calico-system calico-node-9qvtk
...
2022-10-07 15:43:07.866 [INFO][16995] confd/config.go 60: Found FELIX_TYPHAK8SSERVICENAME=calico-typha
2022-10-07 15:43:07.867 [INFO][16995] confd/config.go 60: Found FELIX_TYPHAK8SNAMESPACE=calico-system
2022-10-07 15:43:07.867 [INFO][16995] confd/config.go 60: Found FELIX_TYPHAKEYFILE=/node-certs/tls.key
2022-10-07 15:43:07.867 [INFO][16995] confd/config.go 60: Found FELIX_TYPHACERTFILE=/node-certs/tls.crt
2022-10-07 15:43:07.867 [INFO][16995] confd/config.go 60: Found FELIX_TYPHACAFILE=/etc/pki/tls/certs/tigera-ca-bundle.crt
2022-10-07 15:43:07.867 [INFO][16995] confd/config.go 60: Found FELIX_TYPHACN=typha-server
2022-10-07 15:43:07.867 [INFO][16995] confd/config.go 82: Skipping confd config file.
2022-10-07 15:43:07.867 [INFO][16995] confd/run.go 18: Starting calico-confd
2022-10-07 15:43:07.868 [INFO][16994] status-reporter/startup.go 425: Early log level set to info
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "typhak8sservicename"="calico-typha"
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "typhacafile"="/etc/pki/tls/certs/tigera-ca-bundle.crt"
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "wireguardmtu"="1400"
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "vxlanmtu"="1400"
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "healthenabled"="true"
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "typhacertfile"="/node-certs/tls.crt"
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "typhacn"="typha-server"
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "typhakeyfile"="/node-certs/tls.key"
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "defaultendpointtohostaction"="ACCEPT"
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "ipinipmtu"="1400"
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "typhak8snamespace"="calico-system"
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "healthport"="9099"
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "ipv6support"="false"
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/config_params.go 435: Merging in config from environment variable: map[defaultendpointtohostaction:ACCEPT healthenabled:true healthport:9099 ipinipmtu:1400 ipv6support:false typhacafile:/etc/pki/tls/certs/tigera-ca-bundle.crt typhacertfile:/node-certs/tls.crt typhacn:typha-server typhak8snamespace:calico-system typhak8sservicename:calico-typha typhakeyfile:/node-certs/tls.key vxlanmtu:1400 wireguardmtu:1400]
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/config_params.go 542: Parsing value for TyphaK8sServiceName: calico-typha (from environment variable)
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/config_params.go 578: Parsed value for TyphaK8sServiceName: calico-typha (from environment variable)
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/config_params.go 542: Parsing value for TyphaCAFile: /etc/pki/tls/certs/tigera-ca-bundle.crt (from environment variable)
2022-10-07 15:43:07.872 [INFO][16996] tunnel-ip-allocator/param_types.go 305: Looking for required file path="/etc/pki/tls/certs/tigera-ca-bundle.crt"
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 578: Parsed value for TyphaCAFile: /etc/pki/tls/certs/tigera-ca-bundle.crt (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 542: Parsing value for TyphaK8sNamespace: calico-system (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 578: Parsed value for TyphaK8sNamespace: calico-system (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 542: Parsing value for Ipv6Support: false (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 578: Parsed value for Ipv6Support: false (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 542: Parsing value for TyphaCertFile: /node-certs/tls.crt (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/param_types.go 305: Looking for required file path="/node-certs/tls.crt"
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 578: Parsed value for TyphaCertFile: /node-certs/tls.crt (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 542: Parsing value for WireguardMTU: 1400 (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 578: Parsed value for WireguardMTU: 1400 (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 542: Parsing value for VXLANMTU: 1400 (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 578: Parsed value for VXLANMTU: 1400 (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 542: Parsing value for TyphaCN: typha-server (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 578: Parsed value for TyphaCN: typha-server (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 542: Parsing value for TyphaKeyFile: /node-certs/tls.key (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/param_types.go 305: Looking for required file path="/node-certs/tls.key"
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 578: Parsed value for TyphaKeyFile: /node-certs/tls.key (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 542: Parsing value for HealthPort: 9099 (from environment variable)
2022-10-07 15:43:07.873 [INFO][16996] tunnel-ip-allocator/config_params.go 578: Parsed value for HealthPort: 9099 (from environment variable)
2022-10-07 15:43:07.874 [INFO][16996] tunnel-ip-allocator/config_params.go 542: Parsing value for HealthEnabled: true (from environment variable)
2022-10-07 15:43:07.874 [INFO][16996] tunnel-ip-allocator/config_params.go 578: Parsed value for HealthEnabled: true (from environment variable)
2022-10-07 15:43:07.874 [INFO][16996] tunnel-ip-allocator/config_params.go 542: Parsing value for DefaultEndpointToHostAction: ACCEPT (from environment variable)
2022-10-07 15:43:07.874 [INFO][16996] tunnel-ip-allocator/config_params.go 578: Parsed value for DefaultEndpointToHostAction: ACCEPT (from environment variable)
2022-10-07 15:43:07.874 [INFO][16996] tunnel-ip-allocator/config_params.go 542: Parsing value for IpInIpMtu: 1400 (from environment variable)
2022-10-07 15:43:07.874 [INFO][16996] tunnel-ip-allocator/config_params.go 578: Parsed value for IpInIpMtu: 1400 (from environment variable)
2022-10-07 15:43:07.880 [INFO][16994] status-reporter/config.go 60: Found FELIX_TYPHAK8SSERVICENAME=calico-typha
2022-10-07 15:43:07.880 [INFO][16994] status-reporter/config.go 60: Found FELIX_TYPHAK8SNAMESPACE=calico-system
2022-10-07 15:43:07.880 [INFO][16994] status-reporter/config.go 60: Found FELIX_TYPHAKEYFILE=/node-certs/tls.key
2022-10-07 15:43:07.880 [INFO][16994] status-reporter/config.go 60: Found FELIX_TYPHACERTFILE=/node-certs/tls.crt
2022-10-07 15:43:07.880 [INFO][16994] status-reporter/config.go 60: Found FELIX_TYPHACAFILE=/etc/pki/tls/certs/tigera-ca-bundle.crt
2022-10-07 15:43:07.880 [INFO][16994] status-reporter/config.go 60: Found FELIX_TYPHACN=typha-server
2022-10-07 15:43:07.881 [INFO][16996] tunnel-ip-allocator/config.go 60: Found FELIX_TYPHAK8SSERVICENAME=calico-typha
2022-10-07 15:43:07.881 [INFO][16996] tunnel-ip-allocator/config.go 60: Found FELIX_TYPHAK8SNAMESPACE=calico-system
2022-10-07 15:43:07.881 [INFO][16996] tunnel-ip-allocator/config.go 60: Found FELIX_TYPHAKEYFILE=/node-certs/tls.key
2022-10-07 15:43:07.881 [INFO][16996] tunnel-ip-allocator/config.go 60: Found FELIX_TYPHACERTFILE=/node-certs/tls.crt
2022-10-07 15:43:07.881 [INFO][16996] tunnel-ip-allocator/config.go 60: Found FELIX_TYPHACAFILE=/etc/pki/tls/certs/tigera-ca-bundle.crt
2022-10-07 15:43:07.881 [INFO][16996] tunnel-ip-allocator/config.go 60: Found FELIX_TYPHACN=typha-server
W1007 15:43:08.141232   16995 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2022-10-07 15:43:08.142 [INFO][16995] confd/client.go 1419: Advertise global service ranges from this node
2022-10-07 15:43:08.142 [INFO][16995] confd/client.go 1364: Updated with new cluster IP CIDRs: []
2022-10-07 15:43:08.143 [INFO][16995] confd/client.go 1419: Advertise global service ranges from this node
2022-10-07 15:43:08.143 [INFO][16995] confd/client.go 1355: Updated with new external IP CIDRs: []
2022-10-07 15:43:08.143 [INFO][16995] confd/client.go 1419: Advertise global service ranges from this node
2022-10-07 15:43:08.143 [INFO][16995] confd/client.go 1374: Updated with new Loadbalancer IP CIDRs: []
2022-10-07 15:43:08.161 [ERROR][16996] tunnel-ip-allocator/discovery.go 182: Didn't find any ready Typha instances.
2022-10-07 15:43:08.162 [FATAL][16996] tunnel-ip-allocator/startsyncerclient.go 48: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port
2022-10-07 15:43:08.164 [ERROR][16994] status-reporter/discovery.go 182: Didn't find any ready Typha instances.
2022-10-07 15:43:08.165 [FATAL][16994] status-reporter/startsyncerclient.go 48: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port
2022-10-07 15:43:08.188 [ERROR][16995] confd/discovery.go 182: Didn't find any ready Typha instances.
2022-10-07 15:43:08.188 [FATAL][16995] confd/startsyncerclient.go 48: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port
2022-10-07 15:43:08.300 [INFO][17027] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "typhak8sservicename"="calico-typha"
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "typhacafile"="/etc/pki/tls/certs/tigera-ca-bundle.crt"
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "wireguardmtu"="1400"
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "vxlanmtu"="1400"
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "healthenabled"="true"
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "typhacertfile"="/node-certs/tls.crt"
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "typhacn"="typha-server"
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "typhakeyfile"="/node-certs/tls.key"
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "defaultendpointtohostaction"="ACCEPT"
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "ipinipmtu"="1400"
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "typhak8snamespace"="calico-system"
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "healthport"="9099"
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/env_var_loader.go 40: Found felix environment variable: "ipv6support"="false"
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/config_params.go 435: Merging in config from environment variable: map[defaultendpointtohostaction:ACCEPT healthenabled:true healthport:9099 ipinipmtu:1400 ipv6support:false typhacafile:/etc/pki/tls/certs/tigera-ca-bundle.crt typhacertfile:/node-certs/tls.crt typhacn:typha-server typhak8snamespace:calico-system typhak8sservicename:calico-typha typhakeyfile:/node-certs/tls.key vxlanmtu:1400 wireguardmtu:1400]
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/config_params.go 542: Parsing value for WireguardMTU: 1400 (from environment variable)
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/config_params.go 578: Parsed value for WireguardMTU: 1400 (from environment variable)
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/config_params.go 542: Parsing value for HealthEnabled: true (from environment variable)
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/config_params.go 578: Parsed value for HealthEnabled: true (from environment variable)
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/config_params.go 542: Parsing value for TyphaKeyFile: /node-certs/tls.key (from environment variable)
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/param_types.go 305: Looking for required file path="/node-certs/tls.key"
2022-10-07 15:43:08.301 [INFO][17027] tunnel-ip-allocator/config_params.go 578: Parsed value for TyphaKeyFile: /node-certs/tls.key (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 542: Parsing value for DefaultEndpointToHostAction: ACCEPT (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 578: Parsed value for DefaultEndpointToHostAction: ACCEPT (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 542: Parsing value for TyphaCAFile: /etc/pki/tls/certs/tigera-ca-bundle.crt (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/param_types.go 305: Looking for required file path="/etc/pki/tls/certs/tigera-ca-bundle.crt"
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 578: Parsed value for TyphaCAFile: /etc/pki/tls/certs/tigera-ca-bundle.crt (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 542: Parsing value for VXLANMTU: 1400 (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 578: Parsed value for VXLANMTU: 1400 (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 542: Parsing value for TyphaCertFile: /node-certs/tls.crt (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/param_types.go 305: Looking for required file path="/node-certs/tls.crt"
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 578: Parsed value for TyphaCertFile: /node-certs/tls.crt (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 542: Parsing value for HealthPort: 9099 (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 578: Parsed value for HealthPort: 9099 (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 542: Parsing value for TyphaK8sNamespace: calico-system (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 578: Parsed value for TyphaK8sNamespace: calico-system (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 542: Parsing value for IpInIpMtu: 1400 (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 578: Parsed value for IpInIpMtu: 1400 (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 542: Parsing value for TyphaK8sServiceName: calico-typha (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 578: Parsed value for TyphaK8sServiceName: calico-typha (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 542: Parsing value for Ipv6Support: false (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 578: Parsed value for Ipv6Support: false (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 542: Parsing value for TyphaCN: typha-server (from environment variable)
2022-10-07 15:43:08.302 [INFO][17027] tunnel-ip-allocator/config_params.go 578: Parsed value for TyphaCN: typha-server (from environment variable)
2022-10-07 15:43:08.304 [INFO][17027] tunnel-ip-allocator/config.go 60: Found FELIX_TYPHAK8SSERVICENAME=calico-typha
2022-10-07 15:43:08.304 [INFO][17027] tunnel-ip-allocator/config.go 60: Found FELIX_TYPHAK8SNAMESPACE=calico-system
2022-10-07 15:43:08.304 [INFO][17027] tunnel-ip-allocator/config.go 60: Found FELIX_TYPHAKEYFILE=/node-certs/tls.key
2022-10-07 15:43:08.304 [INFO][17027] tunnel-ip-allocator/config.go 60: Found FELIX_TYPHACERTFILE=/node-certs/tls.crt
2022-10-07 15:43:08.304 [INFO][17027] tunnel-ip-allocator/config.go 60: Found FELIX_TYPHACAFILE=/etc/pki/tls/certs/tigera-ca-bundle.crt
2022-10-07 15:43:08.304 [INFO][17027] tunnel-ip-allocator/config.go 60: Found FELIX_TYPHACN=typha-server
2022-10-07 15:43:08.312 [ERROR][17027] tunnel-ip-allocator/discovery.go 182: Didn't find any ready Typha instances.
2022-10-07 15:43:08.312 [FATAL][17027] tunnel-ip-allocator/startsyncerclient.go 48: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port
2022-10-07 15:43:08.313 [WARNING][16609] felix/health.go 211: Reporter is not ready. name="felix-startup"
2022-10-07 15:43:08.313 [WARNING][16609] felix/health.go 211: Reporter is not ready. name="int_dataplane"
2022-10-07 15:43:08.314 [WARNING][16609] felix/health.go 173: Health: not ready
2022-10-07 15:43:08.318 [INFO][17031] confd/config.go 60: Found FELIX_TYPHAK8SSERVICENAME=calico-typha
2022-10-07 15:43:08.318 [INFO][17031] confd/config.go 60: Found FELIX_TYPHAK8SNAMESPACE=calico-system
2022-10-07 15:43:08.318 [INFO][17031] confd/config.go 60: Found FELIX_TYPHAKEYFILE=/node-certs/tls.key
2022-10-07 15:43:08.318 [INFO][17031] confd/config.go 60: Found FELIX_TYPHACERTFILE=/node-certs/tls.crt
2022-10-07 15:43:08.318 [INFO][17031] confd/config.go 60: Found FELIX_TYPHACAFILE=/etc/pki/tls/certs/tigera-ca-bundle.crt
2022-10-07 15:43:08.318 [INFO][17031] confd/config.go 60: Found FELIX_TYPHACN=typha-server
2022-10-07 15:43:08.318 [INFO][17031] confd/config.go 82: Skipping confd config file.
2022-10-07 15:43:08.318 [INFO][17031] confd/run.go 18: Starting calico-confd
2022-10-07 15:43:08.325 [INFO][17028] status-reporter/startup.go 425: Early log level set to info
2022-10-07 15:43:08.326 [INFO][17028] status-reporter/config.go 60: Found FELIX_TYPHAK8SSERVICENAME=calico-typha
2022-10-07 15:43:08.327 [INFO][17028] status-reporter/config.go 60: Found FELIX_TYPHAK8SNAMESPACE=calico-system
2022-10-07 15:43:08.327 [INFO][17028] status-reporter/config.go 60: Found FELIX_TYPHAKEYFILE=/node-certs/tls.key
2022-10-07 15:43:08.327 [INFO][17028] status-reporter/config.go 60: Found FELIX_TYPHACERTFILE=/node-certs/tls.crt
2022-10-07 15:43:08.327 [INFO][17028] status-reporter/config.go 60: Found FELIX_TYPHACAFILE=/etc/pki/tls/certs/tigera-ca-bundle.crt
2022-10-07 15:43:08.327 [INFO][17028] status-reporter/config.go 60: Found FELIX_TYPHACN=typha-server
W1007 15:43:08.330501   17031 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2022-10-07 15:43:08.330 [INFO][17031] confd/client.go 1419: Advertise global service ranges from this node
2022-10-07 15:43:08.330 [INFO][17031] confd/client.go 1364: Updated with new cluster IP CIDRs: []
2022-10-07 15:43:08.330 [INFO][17031] confd/client.go 1419: Advertise global service ranges from this node
2022-10-07 15:43:08.330 [INFO][17031] confd/client.go 1355: Updated with new external IP CIDRs: []
2022-10-07 15:43:08.331 [INFO][17031] confd/client.go 1419: Advertise global service ranges from this node
2022-10-07 15:43:08.331 [INFO][17031] confd/client.go 1374: Updated with new Loadbalancer IP CIDRs: []
2022-10-07 15:43:08.335 [ERROR][17031] confd/discovery.go 182: Didn't find any ready Typha instances.
2022-10-07 15:43:08.335 [FATAL][17031] confd/startsyncerclient.go 48: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port
2022-10-07 15:43:08.335 [ERROR][17028] status-reporter/discovery.go 182: Didn't find any ready Typha instances.
2022-10-07 15:43:08.335 [FATAL][17028] status-reporter/startsyncerclient.go 48: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port
2022-10-07 15:43:08.820 [INFO][16609] felix/sync_client.go 151: Syncer started addresses=[]discovery.Typha{discovery.Typha{Addr:"192.168.9.3:5473", IP:"192.168.9.3", NodeName:(*string)(0xc000841fd0)}} connID=0x0 type=""
2022-10-07 15:43:08.822 [INFO][16609] felix/sync_client.go 155: connecting to typha endpoint 192.168.9.3:5473 (1 of 1) connID=0x0 type=""
2022-10-07 15:43:08.822 [INFO][16609] felix/sync_client.go 214: Starting Typha client
2022-10-07 15:43:08.822 [INFO][16609] felix/sync_client.go 72:  requiringTLS=true
2022-10-07 15:43:08.823 [INFO][16609] felix/tlsutils.go 39: Make certificate verifier requiredCN="typha-server" requiredURISAN="" roots=&x509.CertPool{byName:map[string][]int{"0!1\x1f0\x1d\x06\x03U\x04\x03\x13\x16tigera-operator-signer":[]int{0}}, lazyCerts:[]x509.lazyCert{x509.lazyCert{rawSubject:[]uint8{0x30, 0x21, 0x31, 0x1f, 0x30, 0x1d, 0x6, 0x3, 0x55, 0x4, 0x3, 0x13, 0x16, 0x74, 0x69, 0x67, 0x65, 0x72, 0x61, 0x2d, 0x6f, 0x70, 0x65, 0x72, 0x61, 0x74, 0x6f, 0x72, 0x2d, 0x73, 0x69, 0x67, 0x6e, 0x65, 0x72}, getCert:(func() (*x509.Certificate, error))(0x71ffc0)}}, haveSum:map[x509.sum224]bool{x509.sum224{0xdd, 0x5f, 0x3e, 0xfd, 0x9d, 0x1c, 0xbf, 0xe7, 0xb4, 0x41, 0x77, 0x9, 0x24, 0x43, 0x7e, 0x72, 0xa1, 0x86, 0x12, 0x53, 0xc0, 0x25, 0xa6, 0xba, 0x69, 0xe8, 0x1a, 0x59}:true}, systemPool:false}
2022-10-07 15:43:08.823 [INFO][16609] felix/sync_client.go 266: Connecting to Typha. address=discovery.Typha{Addr:"192.168.9.3:5473", IP:"192.168.9.3", NodeName:(*string)(0xc000841fd0)} connID=0x0 type=""
2022-10-07 15:43:08.824 [WARNING][16609] felix/sync_client.go 158: error connecting to typha endpoint (1 of 1) 192.168.9.3:5473 connID=0x0 error=dial tcp 192.168.9.3:5473: connect: connection refused type=""
...

I cannot really see whats going on. But I was suspicious to the dns pod which is not in READY state. Also I changed cluster CIDR since it had conflict with my SSH ip subnet. Can you tell what is wrong?

Best, Samie

samiemostafavi commented 1 year ago

Hi,

I could resolve the issues and now all the pads are running healthy:

kubectl get nodes -o wide
NAME       STATUS   ROLES                  AGE     VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
edge-vm1   Ready    control-plane,master   4h19m   v1.22.5+k3s1   10.10.2.31    10.0.87.20    Ubuntu 20.04.5 LTS   5.4.0-128-generic   containerd://1.5.8-k3s1
edge-vm2   Ready    <none>                 69m     v1.24.6+k3s1   10.10.2.32    <none>        Ubuntu 20.04.5 LTS   5.4.0-125-generic   containerd://1.6.8-k3s1
get pods -A -o wide
NAMESPACE          NAME                                       READY   STATUS    RESTARTS   AGE     IP               NODE       NOMINATED NODE   READINESS GATES
tigera-operator    tigera-operator-6f669b6c4f-jqh44           1/1     Running   0          4h15m   10.10.2.31       edge-vm1   <none>           <none>
calico-system      calico-typha-699c7b566f-dkxf9              1/1     Running   0          4h15m   10.10.2.31       edge-vm1   <none>           <none>
calico-system      calico-node-fjcdl                          1/1     Running   0          4h15m   10.10.2.31       edge-vm1   <none>           <none>
kube-system        coredns-85cb69466-nlkx5                    1/1     Running   0          4h19m   192.168.126.81   edge-vm1   <none>           <none>
kube-system        local-path-provisioner-64ffb68fd-qc4sc     1/1     Running   0          4h19m   192.168.126.82   edge-vm1   <none>           <none>
calico-system      calico-kube-controllers-77cf47555c-cwsvw   1/1     Running   0          4h15m   192.168.126.84   edge-vm1   <none>           <none>
kube-system        metrics-server-9cf544f65-2dphg             1/1     Running   0          4h19m   192.168.126.85   edge-vm1   <none>           <none>
calico-system      csi-node-driver-ddfd8                      2/2     Running   0          4h14m   192.168.126.83   edge-vm1   <none>           <none>
calico-apiserver   calico-apiserver-5c854576cc-vfh7p          1/1     Running   0          4h14m   192.168.126.87   edge-vm1   <none>           <none>
calico-apiserver   calico-apiserver-5c854576cc-2jjsm          1/1     Running   0          4h14m   192.168.126.86   edge-vm1   <none>           <none>
calico-system      calico-node-l2rf5                          1/1     Running   0          69m     10.10.2.32       edge-vm2   <none>           <none>
calico-system      csi-node-driver-2gvrl                      2/2     Running   0          69m     192.168.119.1    edge-vm2   <none>           <none>

I added a worker node as you can see with allow_without_reservation = True for zun.

However in Horizon when I choose a container from Dockerhub to create e.g. ubuntu:latest, the container ends up in ERROR state and the reason is There are not enough hosts available.

Could you vlarify how to proceed when the k3s cluster and zun_compute_k8s are healthy?

Best, Samie

msherman64 commented 1 year ago

Great! What was the issue with the cluster health?

To get a container launched, we're still working on the blazar integration.

You can either deploy with enable_blazar: false or ensure that this sections is present in your node_custom_config/zun.conf

[scheduler]
available_filters = zun.scheduler.filters.all_filters
enabled_filters = ComputeFilter,RuntimeFilter

Once we get the steps in place to add your worker nodes to blazar, the "allow_without_reservation" flag changes the behavior of the BlazarFilter scheduler for Zun

samiemostafavi commented 1 year ago

Well, there is a set of steps that I had to take until it got clean. I have documented them here: https://kth-expeca.gitbook.io/testbedconfig/deployment/chi/controller-k3s

Apart from that, I added the new scheduler config but no luck.

I dived deeper into zun logs and I got this:

[req-0b710c21-e650-47e3-8d0b-c0be5432c77d - - - - -] Error during Manager.inventory_host: zun.common.exception.ComputeHostNotFound: Compute host edge-vm1-k8s could not be found.
Traceback (most recent call last):
  File "/var/lib/kolla/venv/lib/python3.8/site-packages/zun/compute/compute_node_tracker.py", line 112, in _get_node_rp_uuid
    self.rp_uuid = self.reportclient.get_provider_by_name(
  File "/var/lib/kolla/venv/lib/python3.8/site-packages/zun/scheduler/client/report.py", line 2135, in get_provider_by_name
    raise exception.ResourceProviderNotFound(name_or_uuid=name)
zun.common.exception.ResourceProviderNotFound: No such resource provider edge-vm1-k8s.

It seems that zun is looking for edge-vm1-k8s as the worker node. My master node where I run Openstack is edge-vm1. Do you know what a ResourceProvider is? Should it refer to the k3s master or the worker?

Best, Samie

samiemostafavi commented 1 year ago

Hi,

Finally it worked.

The problem was the fact that I enable nova and automatically host_shared_with_nova = true takes place in zun config. So I added the following line to my node_custom_config/zun.conf:

host_shared_with_nova = false

Then I did another ./cc-ansible deploy --tags zun and there was no more errors in zun logs. I could run the first ubuntu:latest container on my worker node successfully.

Thanks a lot for your help!

samiemostafavi commented 1 year ago

Great! What was the issue with the cluster health?

To get a container launched, we're still working on the blazar integration.

You can either deploy with enable_blazar: false or ensure that this sections is present in your node_custom_config/zun.conf

[scheduler]
available_filters = zun.scheduler.filters.all_filters
enabled_filters = ComputeFilter,RuntimeFilter

Once we get the steps in place to add your worker nodes to blazar, the "allow_without_reservation" flag changes the behavior of the BlazarFilter scheduler for Zun

We really need this feature. If there is any clue on where to start, I can give it a try. Assuming we create the worker node in blazer, and the user reserve it, can zun find the worker using reservation_id like the way it is done in edgeV1?

msherman64 commented 1 year ago

take a look at the source here: https://github.com/ChameleonCloud/doni/blob/chameleoncloud/xena/doni/driver/worker/blazar/device.py

Using our fork of the blazar client, https://chameleoncloud.readthedocs.io/en/latest/technical/cli.html#openstack-client-installation pip install git+https://github.com/chameleoncloud/python-blazarclient@chameleoncloud/xena

you should be able to run openstack reservation device create. Blazar/Zun/k8s should match on the worker node's hostname.

samiemostafavi commented 1 year ago

Hi,

I would like to connect the k3s containers to the public network and assign floating ips to them. I realized that k3s playbook creates a neutron-calico-connect.sh script but does not run it. Do I have to run it manually on the controller? if yes, do I need to change anything inside? For example I see two parameters hardcoded there:

host_addr="192.168.150.1/24"
ns_addr="192.168.150.2/24"

Are these supposed to be internal_vip_addresses? How can I change them so it works for my case?

Best, Samie