Site playbook does not run through successfully on the first run - Githubissues

clincha-org / clincha

Configuration and monitoring of clinch-home infrastructure

https://clinch-home.com

1 stars 1 forks source link

Site playbook does not run through successfully on the first run #107

Closed clincha closed 10 months ago

clincha commented 1 year ago

Running the Ansible site.yml playbook on a fresh cluster produces an error. Ensure the run completes successfully on a fresh cluster install

clincha commented 1 year ago

This StackOverflow post might be useful to restart core-dns pods.

clincha commented 11 months ago

Velero doesn't install successfully the first time sometimes. I might need to give it some time after the ceph stuff installs

clincha commented 11 months ago

Getting this error which is stopping containers from starting up

Warning FailedCreatePodSandBox 31s (x17 over 4m12s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_node-agent-m2846_velero_e5d05d49-bc80-47f8-8f26-2182c5c332e9_0(d93ada920227f10bfecc65aedbd6bac0e629abb8223e85e254a96c5a3678a86b): error adding pod velero_node-agent-m2846 to CNI network "cbr0": failed to delegate add: failed to set bridge addr: "cni0" already has an IP address different from 10.244.2.1/24

clincha commented 11 months ago

https://stackoverflow.com/questions/61373366/networkplugin-cni-failed-to-set-up-pod-xxxxx-network-failed-to-set-bridge-add

I ran the following command on all the nodes and it fixed the issue

ip link set cni0 down && ip link set flannel.1 down 
ip link delete cni0 && ip link delete flannel.1
systemctl restart podman && systemctl restart kubelet

After making the changes it looks like the cni0 network disappears and the flannel.1 network remains. I guess the issue is that the cni0 network is a leftover from before flannel is installed. I'm going to try and use the nmcli Ansible collection to sort out the issue by removing the interfaces and the restart the services.

clincha commented 11 months ago

nmcli doesn't know about the connections

TASK [flannel : Delete the connections] ****************************************
failed: [bri-master-1] (item=cni0) => {"ansible_loop_var": "item", "changed": false, "item": "cni0", "msg": "Error: unknown connection 'cni0'.\nError: cannot delete unknown connection(s): 'cni0'.\n", "name": "No Connection named cni0 exists", "rc": 10}
failed: [bri-master-1] (item=flannel.1) => {"ansible_loop_var": "item", "changed": false, "item": "flannel.1", "msg": "Error: unknown connection 'flannel.1'.\nError: cannot delete unknown connection(s): 'flannel.1'.\n", "name": "No Connection named flannel.1 exists", "rc": 10}

nmcli shows this output

[kubernetes@bri-master-1 ~]$ nmcli 
eth0: connected to System eth0
        "Red Hat Virtio"
        ethernet (virtio_net), C6:0F:A4:F4:8E:C9, hw, mtu 1500
        ip4 default
        inet4 10.1.2.100/24
        route4 10.1.2.0/24 metric 100
        route4 default via 10.1.2.1 metric 100
        inet6 fe80::c40f:a4ff:fef4:8ec9/64
        route6 fe80::/64 metric 256

cni0: unmanaged
        "cni0"
        bridge, A2:F4:C7:D1:D0:9A, sw, mtu 1500

...

flannel.1: unmanaged
        "flannel.1"
        vxlan, BE:EF:A5:2C:A1:A8, sw, mtu 1450

clincha commented 11 months ago

I've given it a conn_name but I should be giving it a device instead.

clincha commented 11 months ago

Needs to happen at a lower level...

The nmcli module in Ansible is primarily used for creating, deleting, and managing network interfaces through NetworkManager, which is a dynamic network control and configuration system in Linux. However, the task of deleting a network device/interface entirely from a system goes beyond the scope of NetworkManager and thus cannot be performed using the nmcli module.

nmcli is used to control the NetworkManager application and the connections and devices it manages but not to delete a device from the operating system.

Therefore, for such tasks, we resort to using system-level commands like ip link delete, which can be issued through the command or shell modules in Ansible.

If you try to delete a device through the nmcli module, you might get an not found error because the module is trying to remove a network connection, not an actual device.

clincha commented 10 months ago

Welp, it turns out that someone already thought of all this Kubernetes deployment using Ansible stuff. They also did a much better job than me and it's officially supported by Kubernetes.

https://github.com/kubernetes-sigs/kubespray

clincha commented 10 months ago

I implemented the kubespray playbook and rejigged some things around. It's working on the first go now!

clincha commented 10 months ago

I need to run a couple of tests before I can merge this:

DNS resolution works
Creating persistent volume claims works
Backups using Velero works