teardown.yaml playbook fails

sgriffith3 commented 4 years ago

Chad and I just ran the teardown playbook on two separate k8s environments. His was untainted (He only had the install run prior to running the teardown), and mine had been used to build deployments, pods, crds, etc.

His worked perfectly fine, as expected.

My attempt to run the teardown resulted in this error, only for one of my nodes:

(Under the remove directories for the kubernetes nodes section)

failed: [node3] (item=/var/lib/kubelet) => {"ansible_loop_var": "item", "changed": false, "item": "/var/lib/kubelet", "msg": "rmtree failed: [Errno 16] Device or resource busy: 'default-token-8qqbt'"}

As far as I am aware, this 'default-token-8qqbt' is simply created as a cluster-wide secret. (Of course, the rest of the teardown worked, so I cannot verify what information this had from inside of the cluster).

I next attempted to ssh to node3 and manually remove the directory /var/lib/kubelet.

student@k8s-1946-node-03:/var/lib$ sudo rm kubelet/ -r
rm: cannot remove 'kubelet/pods/a8d62734-c9d5-4c68-ac26-4a1ec0d1ab51/volumes/kubernetes.io~secret/default-token-8qqbt': Device or resource busy
rm: cannot remove 'kubelet/pods/72ab2116-0cc1-4a32-ac48-175d8cb0e85f/volumes/kubernetes.io~secret/default-token-8qqbt': Device or resource busy
rm: cannot remove 'kubelet/pods/fa7429fb-22fa-42c9-89cd-3d2ef5356c25/volumes/kubernetes.io~secret/default-token-9z7r5': Device or resource busy
rm: cannot remove 'kubelet/pods/9d24e38e-a023-4fa2-a79f-9aa5d58027c9/volumes/kubernetes.io~secret/default-token-8qqbt': Device or resource busy

I then verified the kubelet was not running, and then disabled kubelet. Tried to remove the directory and am still getting the same error.

I was going to stop there. But I didn't.

After some googling, I saw many people saying that this usually meant that there was something mounted. Typing a mount | grep pods command in, I noticed the following:

tmpfs on /var/lib/kubelet/pods/72ab2116-0cc1-4a32-ac48-175d8cb0e85f/volumes/kubernetes.io~secret/default-token-8qqbt type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/pods/fa7429fb-22fa-42c9-89cd-3d2ef5356c25/volumes/kubernetes.io~secret/default-token-9z7r5 type tmpfs (rw,relatime)

Therefore, I went ahead and performed the appropriate unmounting commands for each (there were really 4 of them), sudo umount /var/lib/kubelet/pods/....

Then, I was able to remove the /var/lib/kubelet dir, and everything was coming up roses.

I ran the playbook again, and it just worked.

@bryfry I am not super familiar with mounting/umounting things. Is this something that we could add into the playbook with a wildcard?

- name: unmount pesky tmpfs
  command: 'umount /var/lib/kubelet/pods/*'
  become: yes

?

sgriffith3 commented 4 years ago

I have attempted to make a solution for this here: https://github.com/alta3/kubernetes-the-alta3-way/blob/77b7527b0838d49c95222cde9c8e76e7ba4f95be/teardown.yml#L92

It has been working for me on basic clusters. It needs to be tested after a full 10 + labs to assure that it works.

sgriffith3 commented 4 years ago

Teardown playbook is working in version 1.17. NOT in 1.15!!! Closing this issue.

sgriffith3 commented 4 years ago

This playbook failed for Chad during his last k8s teach, as well as after I ran the conformance testing this week. It is not production ready. This needs to be investigated further.

sgriffith3 commented 4 years ago

My initial run through gave me these errors:

TASK [Remove calico service] ***************************************************************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["kubectl", "delete", "-f", "/home/student/k8s-config/calico-etcd.yaml"], "delta": "0:00:00.730632", "end": "2020-04-17 15:56:02.053304", "msg": "non-zero return code",
 "rc": 1, "start": "2020-04-17 15:56:01.322672", "stderr": "Error from server (NotFound): error when deleting \"/home/student/k8s-config/calico-etcd.yaml\": daemonsets.apps \"calico-node\" not found", "stderr_lines": ["Error
 from server (NotFound): error when deleting \"/home/student/k8s-config/calico-etcd.yaml\": daemonsets.apps \"calico-node\" not found"], "stdout": "secret \"calico-etcd-secrets\" deleted\nconfigmap \"calico-config\" deleted\
nclusterrole.rbac.authorization.k8s.io \"calico-kube-controllers\" deleted\nclusterrolebinding.rbac.authorization.k8s.io \"calico-kube-controllers\" deleted\nclusterrole.rbac.authorization.k8s.io \"calico-node\" deleted\nclu
sterrolebinding.rbac.authorization.k8s.io \"calico-node\" deleted\nserviceaccount \"calico-node\" deleted\ndeployment.apps \"calico-kube-controllers\" deleted\nserviceaccount \"calico-kube-controllers\" deleted", "stdout_lin
es": ["secret \"calico-etcd-secrets\" deleted", "configmap \"calico-config\" deleted", "clusterrole.rbac.authorization.k8s.io \"calico-kube-controllers\" deleted", "clusterrolebinding.rbac.authorization.k8s.io \"calico-kube-
controllers\" deleted", "clusterrole.rbac.authorization.k8s.io \"calico-node\" deleted", "clusterrolebinding.rbac.authorization.k8s.io \"calico-node\" deleted", "serviceaccount \"calico-node\" deleted", "deployment.apps \"ca
lico-kube-controllers\" deleted", "serviceaccount \"calico-kube-controllers\" deleted"]}                                                                                                                                        
...ignoring                                                                                                                                                                                                                     

TASK [Remove kube-dns service] *************************************************************************************************************************************************************************************************
changed: [localhost]                                                                                                                                                                                                            

TASK [Remove all service resources] ********************************************************************************************************************************************************************************************
changed: [localhost]                                                                                                                                                                                                            

TASK [Drain all nodes] *********************************************************************************************************************************************************************************************************
failed: [localhost] (item=k8s-1885-node-01) => {"ansible_loop_var": "item", "changed": true, "cmd": ["kubectl", "drain", "k8s-1885-node-01"], "delta": "0:00:00.913761", "end": "2020-04-17 15:56:19.150400", "item": "k8s-1885-
node-01", "msg": "non-zero return code", "rc": 1, "start": "2020-04-17 15:56:18.236639", "stderr": "error: unable to drain node \"k8s-1885-node-01\", aborting command...\n\nThere are pending nodes to be drained:\n k8s-1885-n
ode-01\nerror: cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): test/client02", "stderr_lines": ["error: unable to drain node \"k8s-1885-node-01\",
 aborting command...", "", "There are pending nodes to be drained:", " k8s-1885-node-01", "error: cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): 
test/client02"], "stdout": "node/k8s-1885-node-01 cordoned", "stdout_lines": ["node/k8s-1885-node-01 cordoned"]}                                                                                                                
changed: [localhost] => (item=k8s-1885-node-02)                                                                                                                                                                                 
failed: [localhost] (item=k8s-1885-node-03) => {"ansible_loop_var": "item", "changed": true, "cmd": ["kubectl", "drain", "k8s-1885-node-03"], "delta": "0:00:01.062244", "end": "2020-04-17 15:56:21.927044", "item": "k8s-1885-
node-03", "msg": "non-zero return code", "rc": 1, "start": "2020-04-17 15:56:20.864800", "stderr": "error: unable to drain node \"k8s-1885-node-03\", aborting command...\n\nThere are pending nodes to be drained:\n k8s-1885-n
ode-03\nerror: cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): test/nginx", "stderr_lines": ["error: unable to drain node \"k8s-1885-node-03\", ab
orting command...", "", "There are pending nodes to be drained:", " k8s-1885-node-03", "error: cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): tes
t/nginx"], "stdout": "node/k8s-1885-node-03 cordoned", "stdout_lines": ["node/k8s-1885-node-03 cordoned"]}                                                                                                                      
...ignoring                                                                                                                                                                                                                     

TASK [Remove kubelet from bchd] ************************************************************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Could not find the requested service kubelet: host"}                                                                                                                  
...ignoring

I tried running through again with the same result.

I then read the errors, and decided to use the --force option to drain all the nodes, even the pods that were not managed. This then worked (though it showed other errors not needed here).

I then installed k8s with the playbook again, and it was performing as expected. I ran the teardown script (on an empty cluster) and it worked without these errors.

I intend to steal @seaneon's system and attempt this test again.

sgriffith3 commented 4 years ago

@seaneon We need to remember to run the playbook against all of Chad and Zach's machines from last week prior to tearing down Cloud B on Thursday.

seaneon commented 4 years ago

I can do one class Wednesday evening and the other on Thursday morning. Did we ever get @csfeeser last results from teardown attempt? I recall him saying it was successful?

sgriffith3 commented 4 years ago

He told us last night that his was successful. But he hadn't done many of the labs. Therefore, we need to verify that it works for all the machines. So we should probably do one at a time until we see at least 5 in a row that just work, so we have a bunch of backup machines to tear down after we have made some changes. Let's just plan on doing it Thursday morning. I will get started around 8 and hopefully have them all done by 9 ish.

seaneon commented 4 years ago

Okie dokie sounds like a plan!

On Tue, Apr 28, 2020, 1:33 PM Sam Griffith notifications@github.com wrote:

He told us last night that his was successful. But he hadn't done many of the labs. Therefore, we need to verify that it works for all the machines. So we should probably do one at a time until we see at least 5 in a row that just work, so we have a bunch of backup machines to tear down after we have made some changes. Let's just plan on doing it Thursday morning. I will get started around 8 and hopefully have them all done by 9 ish.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alta3/kubernetes-the-alta3-way/issues/2#issuecomment-620750176, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIRX3TRMAX7CK7R3W5O4NN3RO4HOJANCNFSM4LGO4UVA .

sgriffith3 commented 4 years ago

This playbook ran successfully on 4 machines that students used, and allowed for the main.yaml playbook to be used again. Any future errors on this playbook should be turned into a new issue.

alta3 / kubernetes-the-alta3-way

teardown.yaml playbook fails #2