coreos / tectonic-installer

Install a Kubernetes cluster the CoreOS Tectonic Way: HA, self-hosted, RBAC, etcd Operator, and more
Apache License 2.0
601 stars 266 forks source link

BUG Report : Tectonic Console and API fail after a few days #1199

Open nkrgovic opened 7 years ago

nkrgovic commented 7 years ago

Versions

$ kubectl get pods error: error fetching provider config: invalid character '<' looking for beginning of value

When ssh-ing to the master I get :

Container Linux by CoreOS stable (1409.5.0) Update Strategy: No Reboots Failed Units: 12 bootkube.service sshd@3013-10.0.49.161:22-221.229.166.44:4133.service sshd@3015-10.0.49.161:22-221.229.166.44:3850.service sshd@3017-10.0.49.161:22-221.229.166.44:1468.service sshd@3019-10.0.49.161:22-221.229.166.44:1038.service sshd@3020-10.0.49.161:22-221.229.166.44:1361.service sshd@3023-10.0.49.161:22-221.229.166.44:3774.service sshd@3029-10.0.49.161:22-221.229.166.44:4005.service sshd@3030-10.0.49.161:22-221.229.166.44:1339.service sshd@3033-10.0.49.161:22-221.229.166.44:1427.service sshd@3034-10.0.49.161:22-221.229.166.44:1949.service sshd@3037-10.0.49.161:22-221.229.166.44:3433.service

No errors in the logs, looks strange :

bootkube.service - Bootstrap a Kubernetes cluster Loaded: loaded (/etc/systemd/system/bootkube.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Tue 2017-06-27 10:07:19 UTC; 8s ago Process: 14767 ExecStart=/usr/bin/bash /opt/tectonic/bootkube.sh (code=exited, status=127) Main PID: 14767 (code=exited, status=127) CPU: 1ms

● bootkube.service - Bootstrap a Kubernetes cluster Loaded: loaded (/etc/systemd/system/bootkube.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Tue 2017-06-27 10:07:19 UTC; 8s ago Process: 14767 ExecStart=/usr/bin/bash /opt/tectonic/bootkube.sh (code=exited, status=127) Main PID: 14767 (code=exited, status=127) CPU: 1ms

Jun 27 10:07:19 ip-10-0-49-161 systemd[1]: Starting Bootstrap a Kubernetes cluster... Jun 27 10:07:19 ip-10-0-49-161 bash[14767]: /usr/bin/bash: /opt/tectonic/bootkube.sh: No such file or directory Jun 27 10:07:19 ip-10-0-49-161 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a Jun 27 10:07:19 ip-10-0-49-161 systemd[1]: Failed to start Bootstrap a Kubernetes cluster. Jun 27 10:07:19 ip-10-0-49-161 systemd[1]: bootkube.service: Unit entered failed state. Jun 27 10:07:19 ip-10-0-49-161 systemd[1]: bootkube.service: Failed with result 'exit-code'.

And, really, no file was there in /opt/tectonic .

Tried to restart the service :

ip-10-0-49-161 tectonic # journalctl -xe -- Subject: Unit sshd@3796-10.0.49.161:22-116.31.116.52:61377.service has finished start-up -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit sshd@3796-10.0.49.161:22-116.31.116.52:61377.service has finished starting up.

-- The start-up result is done. Jun 27 10:07:43 ip-10-0-49-161 sshd[14847]: pam_tally2(sshd:auth): Tally overflowed for user root Jun 27 10:07:43 ip-10-0-49-161 sshd[14847]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=116.31.116.52 user=root Jun 27 10:07:45 ip-10-0-49-161 sshd[14847]: Failed password for root from 116.31.116.52 port 61377 ssh2 Jun 27 10:07:45 ip-10-0-49-161 sshd[14847]: pam_tally2(sshd:auth): Tally overflowed for user root Jun 27 10:07:47 ip-10-0-49-161 sshd[14847]: Failed password for root from 116.31.116.52 port 61377 ssh2 Jun 27 10:07:48 ip-10-0-49-161 sshd[14847]: pam_tally2(sshd:auth): Tally overflowed for user root Jun 27 10:07:50 ip-10-0-49-161 sshd[14847]: Failed password for root from 116.31.116.52 port 61377 ssh2 Jun 27 10:07:50 ip-10-0-49-161 sshd[14847]: Received disconnect from 116.31.116.52 port 61377:11: [preauth] Jun 27 10:07:50 ip-10-0-49-161 sshd[14847]: Disconnected from 116.31.116.52 port 61377 [preauth] Jun 27 10:07:50 ip-10-0-49-161 sshd[14847]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=116.31.116.52 user=root Jun 27 10:07:54 ip-10-0-49-161 kubelet-wrapper[999]: I0627 10:07:54.384575 999 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/55430af2-49c2-11e7-b0c6-06c1161fe869-default-token- Jun 27 10:07:56 ip-10-0-49-161 kubelet-wrapper[999]: I0627 10:07:56.390079 999 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/5542b484-49c2-11e7-b0c6-06c1161fe869-default-token- Jun 27 10:07:56 ip-10-0-49-161 kubelet-wrapper[999]: I0627 10:07:56.390868 999 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/configmap/5542b484-49c2-11e7-b0c6-06c1161fe869-flannel-cfg Jun 27 10:07:57 ip-10-0-49-161 kubelet-wrapper[999]: I0627 10:07:57.392566 999 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/55409bc4-49c2-11e7-b0c6-06c1161fe869-default-token- Jun 27 10:07:57 ip-10-0-49-161 kubelet-wrapper[999]: I0627 10:07:57.392603 999 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/55409bc4-49c2-11e7-b0c6-06c1161fe869-secrets" (spec Jun 27 10:08:00 ip-10-0-49-161 kubelet-wrapper[999]: I0627 10:08:00.399164 999 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/a6536990-49c2-11e7-bdcc-029ffa679809-default-token- Jun 27 10:08:02 ip-10-0-49-161 systemd[1]: Starting Bootstrap a Kubernetes cluster... -- Subject: Unit bootkube.service has begun start-up -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit bootkube.service has begun starting up. Jun 27 10:08:02 ip-10-0-49-161 bash[14955]: /usr/bin/bash: /opt/tectonic/bootkube.sh: No such file or directory Jun 27 10:08:02 ip-10-0-49-161 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a Jun 27 10:08:02 ip-10-0-49-161 systemd[1]: Failed to start Bootstrap a Kubernetes cluster. -- Subject: Unit bootkube.service has failed -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit bootkube.service has failed.

-- The result is failed. Jun 27 10:08:02 ip-10-0-49-161 systemd[1]: bootkube.service: Unit entered failed state. Jun 27 10:08:02 ip-10-0-49-161 systemd[1]: bootkube.service: Failed with result 'exit-code'.

I've tried to reboot the machine, since it looked stuck, and wanted to avoid a race condition.

After reboot, still can't get it to work:

Container Linux by CoreOS stable (1409.5.0) Update Strategy: No Reboots Failed Units: 1 bootkube.service core@ip-10-0-49-161 ~ $ core@ip-10-0-49-161 ~ $ sudo -i Update Strategy: No Reboots Failed Units: 1 bootkube.service ip-10-0-49-161 ~ # systemctl restart bootkube Job for bootkube.service failed because the control process exited with error code. See "systemctl status bootkube.service" and "journalctl -xe" for details. ip-10-0-49-161 ~ # journalctl -xe Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: with error: exit status 1 Jun 27 10:14:07 ip-10-0-49-161 dockerd[988]: time="2017-06-27T10:14:07.729086177Z" level=error msg="Handler for POST /v1.24/containers/dd06d99ed1837f16bb347793b7e8de46d40695e94dd3384aeacecdc8b42c9339/stop returned error: Co Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: W0627 10:14:07.733673 972 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "container-linux-update-agent-ds-7tn86_tectonic-system": Unexpect Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: with error: exit status 1 Jun 27 10:14:07 ip-10-0-49-161 dockerd[988]: time="2017-06-27T10:14:07.737150388Z" level=error msg="Handler for POST /v1.24/containers/3c3054cfc2bc8064534134a937fcc22b04ea7a3985996e4a329019a03889fb43/stop returned error: Co Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: W0627 10:14:07.742216 972 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "container-linux-update-agent-ds-7tn86_tectonic-system": Unexpect Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: with error: exit status 1 Jun 27 10:14:07 ip-10-0-49-161 dockerd[988]: time="2017-06-27T10:14:07.745815588Z" level=error msg="Handler for POST /v1.24/containers/8a6185d3f452285874558c0da3fbcd3413c137f7ca38766a3d24c6c173a4e826/stop returned error: Co Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: W0627 10:14:07.751198 972 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "container-linux-update-agent-ds-7tn86_tectonic-system": Unexpect Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: with error: exit status 1 Jun 27 10:14:07 ip-10-0-49-161 dockerd[988]: time="2017-06-27T10:14:07.754568827Z" level=error msg="Handler for POST /v1.24/containers/92d04ba51d5b88056146a77c4bc505327bcae4473989325fcf0308a71ab1f753/stop returned error: Co Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: W0627 10:14:07.759356 972 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "container-linux-update-agent-ds-7tn86_tectonic-system": Unexpect Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: with error: exit status 1 Jun 27 10:14:07 ip-10-0-49-161 dockerd[988]: time="2017-06-27T10:14:07.762743451Z" level=error msg="Handler for POST /v1.24/containers/9d5bf41c73f4b065ff2948dab76a7e0528df763d18793835551ef4dd3c23071e/stop returned error: Co Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: W0627 10:14:07.768141 972 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "container-linux-update-agent-ds-7tn86_tectonic-system": Unexpect Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: with error: exit status 1 Jun 27 10:14:07 ip-10-0-49-161 dockerd[988]: time="2017-06-27T10:14:07.771432766Z" level=error msg="Handler for POST /v1.24/containers/4218907cb3e98466dc6d689fdac344435e963a44c57faf1658eb5a8d0bcf5c61/stop returned error: Co Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: W0627 10:14:07.776668 972 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "container-linux-update-agent-ds-7tn86_tectonic-system": Unexpect Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: with error: exit status 1 Jun 27 10:14:07 ip-10-0-49-161 dockerd[988]: time="2017-06-27T10:14:07.779932982Z" level=error msg="Handler for POST /v1.24/containers/6cc3193fe16bff23b3ba9edaaf71415fdbc101f4c6fc39025ab1b3ae66507d72/stop returned error: Co Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: W0627 10:14:07.784838 972 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "container-linux-update-agent-ds-7tn86_tectonic-system": Unexpect Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: with error: exit status 1 Jun 27 10:14:07 ip-10-0-49-161 dockerd[988]: time="2017-06-27T10:14:07.788088140Z" level=error msg="Handler for POST /v1.24/containers/fc932f6cf7586a72f3da190f7b53e47333c2f1390145a72da178934d6bd2812d/stop returned error: Co Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: W0627 10:14:07.793076 972 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "container-linux-update-agent-ds-7tn86_tectonic-system": Unexpect Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: with error: exit status 1 Jun 27 10:14:07 ip-10-0-49-161 dockerd[988]: time="2017-06-27T10:14:07.796372123Z" level=error msg="Handler for POST /v1.24/containers/e7db7899ce1b4975908a5255ca97ed41cc0996ce39edea7ccd32655e7f39e07e/stop returned error: Co Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: W0627 10:14:07.801319 972 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "container-linux-update-agent-ds-7tn86_tectonic-system": Unexpect Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: with error: exit status 1 Jun 27 10:14:07 ip-10-0-49-161 dockerd[988]: time="2017-06-27T10:14:07.804660493Z" level=error msg="Handler for POST /v1.24/containers/830e981117e82f34b406db837368e64fc542fbeadc00ff6d6562cc92a639a439/stop returned error: Co Jun 27 10:14:07 ip-10-0-49-161 kernel: SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: E0627 10:14:07.954091 972 cni.go:257] Error adding network: open /run/flannel/subnet.env: no such file or directory Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: E0627 10:14:07.954121 972 cni.go:211] Error while adding to cni network: open /run/flannel/subnet.env: no such file or directory Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: I0627 10:14:07.963367 972 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/5542b484-49c2-11e7-b0c6-06c1161fe869-default-token- Jun 27 10:14:07 ip-10-0-49-161 kubelet-wrapper[972]: I0627 10:14:07.963482 972 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/configmap/5542b484-49c2-11e7-b0c6-06c1161fe869-flannel-cfg Jun 27 10:14:08 ip-10-0-49-161 kubelet-wrapper[972]: E0627 10:14:08.017668 972 remote_runtime.go:86] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "containe Jun 27 10:14:08 ip-10-0-49-161 kubelet-wrapper[972]: E0627 10:14:08.017713 972 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "container-linux-update-agent-ds-7tn86_tectonic-system(a6536990-49c2-11e7-bdcc-029ffa679 Jun 27 10:14:08 ip-10-0-49-161 kubelet-wrapper[972]: E0627 10:14:08.017729 972 kuberuntime_manager.go:619] createPodSandbox for pod "container-linux-update-agent-ds-7tn86_tectonic-system(a6536990-49c2-11e7-bdcc-029ffa67 Jun 27 10:14:08 ip-10-0-49-161 kubelet-wrapper[972]: E0627 10:14:08.017759 972 pod_workers.go:182] Error syncing pod a6536990-49c2-11e7-bdcc-029ffa679809 ("container-linux-update-agent-ds-7tn86_tectonic-system(a6536990- Jun 27 10:14:08 ip-10-0-49-161 kubelet-wrapper[972]: I0627 10:14:08.256276 972 kuberuntime_manager.go:458] Container {Name:kube-flannel Image:quay.io/coreos/flannel:v0.7.1-amd64 Command:[/opt/bin/flanneld --ip-masq --ku Jun 27 10:14:08 ip-10-0-49-161 kubelet-wrapper[972]: I0627 10:14:08.256527 972 kuberuntime_manager.go:742] checking backoff for container "kube-flannel" in pod "kube-flannel-sxj3t_kube-system(5542b484-49c2-11e7-b0c6-06c Jun 27 10:14:08 ip-10-0-49-161 kernel: SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)

What you expected to happen?

I didn't expect anything to change. We've done no changes on the config maps, or anything with infrastructure - we just added the apps, and the console, and the api stopped responding. I guess it's due to a automated update? Or something changed in AWS.... In any case it stopped working, and have no idea where to start looking for issues....

How to reproduce it (as minimally and precisely as possible)?

Sadly, no idea. That's the issue - it failed on it's own. :(

Anything else we need to know?

nkrgovic commented 7 years ago

The issue is that both the tectonic console and the GUI are non-resposive . The web interface gives 503 , and the kubectl acts like it got xml/html format:

kubectl get po error: error fetching provider config: invalid character '<' looking for beginning of value

Tried running update from the shell on the masters :

update_engine_client -update

I0629 13:32:41.905858 6770 update_engine_client.cc:247] Initiating update check and install. I0629 13:32:41.907613 6770 update_engine_client.cc:252] Waiting for update to complete. LAST_CHECKED_TIME=1498743162 PROGRESS=0.000000 CURRENT_OP=UPDATE_STATUS_IDLE NEW_VERSION=0.0.0 NEW_SIZE=0 E0629 13:32:47.041983 6770 update_engine_client.cc:190] Update failed.

but it fails.

Tried moving the MTU of all the interfaces to 1500, but also no help. I'd really appreciate at least a hint where to start....

nkrgovic commented 7 years ago

Also seeing very strange reports in dmesg, including out of memory - on m4.xlarge 16GB machine:

nodemask=(null), order=0, oom_score_adj=999 [ 2275.744749] node_exporter cpuset=872d9b1d13a2ac606db6668e660db171a701bac5207f6d023e5732f66fda4c26 mems_allowed=0 [ 2275.749023] CPU: 2 PID: 7842 Comm: node_exporter Not tainted 4.11.6-coreos-r1 #1 [ 2275.752107] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016 [ 2275.754679] Call Trace: [ 2275.755736] dump_stack+0x63/0x90 [ 2275.757160] dump_header+0x9f/0x227 [ 2275.759022] oom_kill_process+0x21c/0x3f0 [ 2275.760885] out_of_memory+0x11a/0x4b0 [ 2275.762617] mem_cgroup_out_of_memory+0x4b/0x80 [ 2275.764727] mem_cgroup_oom_synchronize+0x2f9/0x320 [ 2275.767076] ? high_work_func+0x20/0x20 [ 2275.768856] pagefault_out_of_memory+0x36/0x80 [ 2275.772241] mm_fault_error+0x8c/0x190 [ 2275.774227] ? handle_mm_fault+0xd1/0x240 [ 2275.776160] __do_page_fault+0x44f/0x4b0 [ 2275.777902] do_page_fault+0x22/0x30 [ 2275.779483] page_fault+0x28/0x30 [ 2275.780951] RIP: 0033:0x40907a [ 2275.782312] RSP: 002b:000000c422c44c90 EFLAGS: 00010206 [ 2275.784552] RAX: 00000000008643a0 RBX: 00000000008b5820 RCX: 000000c422b59040 [ 2275.787656] RDX: 00000000008643a0 RSI: 000000c422bbf5e0 RDI: 0000000000000002 [ 2275.790725] RBP: 000000c422c44ce0 R08: 0000000000000008 R09: 0000000000000000 [ 2275.793788] R10: 00007f0d919ea420 R11: 000000c422bbe000 R12: 0000000000000010 [ 2275.796856] R13: 00000000000015e0 R14: 000000000000015f R15: 0000000000000003 [ 2275.799919] Task in /kubepods/burstable/pod828f97ee-49c2-11e7-bdcc-029ffa679809/872d9b1d13a2ac606db6668e660db171a701bac5207f6d023e5732f66fda4c26 killed as a result of limit of /kubepods/burstable/pod828f97ee-49c2-11e7-bdcc-029ffa679809 [ 2275.809162] memory: usage 51200kB, limit 51200kB, failcnt 3023950 [ 2275.811839] memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0 [ 2275.814862] kmem: usage 2784kB, limit 9007199254740988kB, failcnt 0 [ 2275.817700] Memory cgroup stats for /kubepods/burstable/pod828f97ee-49c2-11e7-bdcc-029ffa679809: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB [ 2275.828292] Memory cgroup stats for /kubepods/burstable/pod828f97ee-49c2-11e7-bdcc-029ffa679809/c98f003dd34e0d17c175d13c1defbd5fd7d9ab3497a44124b3eed0f44d812e32: cache:0KB rss:40KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:40KB inactive_file:0KB active_file:0KB unevictable:0KB [ 2275.841046] Memory cgroup stats for /kubepods/burstable/pod828f97ee-49c2-11e7-bdcc-029ffa679809/872d9b1d13a2ac606db6668e660db171a701bac5207f6d023e5732f66fda4c26: cache:204KB rss:47996KB rss_huge:24576KB mapped_file:76KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:47996KB inactive_file:204KB active_file:0KB unevictable:0KB [ 2275.855121] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 2275.858832] [ 1602] 0 1602 256 1 5 2 0 -998 pause [ 2275.862627] [ 1637] 0 1637 14819 9901 33 5 0 999 node_exporter [ 2275.866738] Memory cgroup out of memory: Kill process 1637 (node_exporter) score 1713 or sacrifice child [ 2275.871171] Killed process 1637 (node_exporter) total-vm:59276kB, anon-rss:39604kB, file-rss:0kB, shmem-rss:0kB [ 2276.540958] oom_reaper: reaped process 1637 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

sym3tri commented 7 years ago

A couple things I would check:

Validate that the etcd cluster is up and healthy.

Have you tried running docker ps on the machines? If the expected kubernetes components are not running you may need to read through the disaster recovery steps for the scheduler (we have some tooling to help automate this coming soon).

https://coreos.com/tectonic/docs/latest/troubleshooting/controller-recovery.html

nkrgovic commented 7 years ago

The issue I have is that even the kubctl isn't responding - like it receives input in something like html/xml....

$ kubectl get deployment kube-scheduler -o yaml error: error fetching provider config: invalid character '<' looking for beginning of value

I've tried to reboot the masters, didn' help.

How would you validate the etcd cluster status?

sym3tri commented 7 years ago

How would you validate the etcd cluster status?

SSH into the etcd nodes. Check the status with systemctl status ...

Use journalctl to view the logs.

Use etcdctl to see if you can read/write values to etcd.

nkrgovic commented 7 years ago

Good call! Tough no effect....

core@ip-10-0-53-106 ~ $ ncat 10.0.90.80 2380 Ncat: Connection timed out. core@ip-10-0-53-106 ~ $ ncat 10.0.112.77 2380 Ncat: Connection timed out. core@ip-10-0-53-106 ~ $ ncat 10.0.67.36 2380 Ncat: Connection timed out.

2379 works, tried to read from the daemon. Found only 1 thing:

core@production-etcd-0 ~ $ etcdctl ls /coreos.com /coreos.com/updateengine core@production-etcd-0 ~ $ etcdctl ls /coreos.com/updateengine /coreos.com/updateengine/rebootlock core@production-etcd-0 ~ $ etcdctl ls /coreos.com/updateengine/rebootlock /coreos.com/updateengine/rebootlock/semaphore

Guessing it's stuck in an update. On machine 1 (etc-0) I get this:

production-etcd-0 ~ # ps ax | grep etc 829 ? Ssl 156:54 /usr/local/bin/etcd --name=etcd --discovery-srv=piratetech.io --advertise-client-urls=http://production-etcd-0.piratetech.io:2379 --initial-advertise-peer-urls=http://production-etcd-0.piratetech.io:2380 --listen-client-urls=http://0.0.0.0:2379 --listen-peer-urls=http://0.0.0.0:2380

Try to get status:

production-etcd-0 ~ # systemctl status etcd Unit etcd.service could not be found.

Move to the other machines (etc-1 and etc-2). Find it stuck:

production-etcd-1 ~ # systemctl status etcd ● etcd.service - etcd    Loaded: loaded (/usr/lib/systemd/system/etcd.service; static; vendor prese   Drop-In: /run/systemd/system/etcd.service.d            └─10-oem.conf    Active: inactive (dead)

OK, I try to restart a service:

production-etcd-1 ~ # systemctl restart etcd

Broadcast message from locksmithd at 2017-06-30 13:37:41.272140822 +0000 UTC: System reboot in 5 minutes!

After the reboot, I ssh back into both machines, and it's not visible any more:

production-etcd-1 ~ # ps ax | grep etc 3151 pts/0 S+ 0:00 grep --colour=auto etc production-etcd-1 ~ # systemctl start etcd Failed to restart etcd.service: Unit etcd.service not found.

Not running, service not registered.

Tried an update:

production-etcd-1 ~ # update_engine_client -update I0630 13:58:56.631924 4616 update_engine_client.cc:247] Initiating update check and install. I0630 13:58:56.634253 4616 update_engine_client.cc:252] Waiting for update to complete. LAST_CHECKED_TIME=1498831137 PROGRESS=0.000000 CURRENT_OP=UPDATE_STATUS_UPDATE_AVAILABLE NEW_VERSION=1409.5.0 NEW_SIZE=277339901 LAST_CHECKED_TIME=1498831137 PROGRESS=0.000000 CURRENT_OP=UPDATE_STATUS_UPDATE_AVAILABLE NEW_VERSION=1409.5.0 NEW_SIZE=277339901 LAST_CHECKED_TIME=1498831137 PROGRESS=0.000000 CURRENT_OP=UPDATE_STATUS_UPDATE_AVAILABLE NEW_VERSION=1409.5.0 NEW_SIZE=277339901 LAST_CHECKED_TIME=1498831137 PROGRESS=0.000000 CURRENT_OP=UPDATE_STATUS_UPDATE_AVAILABLE NEW_VERSION=1409.5.0 NEW_SIZE=277339901 LAST_CHECKED_TIME=1498831137 PROGRESS=0.100423 CURRENT_OP=UPDATE_STATUS_DOWNLOADING NEW_VERSION=1409.5.0 NEW_SIZE=277339901 LAST_CHECKED_TIME=1498831137 PROGRESS=0.251065 CURRENT_OP=UPDATE_STATUS_DOWNLOADING NEW_VERSION=1409.5.0 NEW_SIZE=277339901 LAST_CHECKED_TIME=1498831137 PROGRESS=0.441879 CURRENT_OP=UPDATE_STATUS_DOWNLOADING NEW_VERSION=1409.5.0 NEW_SIZE=277339901 LAST_CHECKED_TIME=1498831137 PROGRESS=0.552351 CURRENT_OP=UPDATE_STATUS_DOWNLOADING NEW_VERSION=1409.5.0 NEW_SIZE=277339901 LAST_CHECKED_TIME=1498831137 PROGRESS=0.753207 CURRENT_OP=UPDATE_STATUS_DOWNLOADING NEW_VERSION=1409.5.0 NEW_SIZE=277339901 LAST_CHECKED_TIME=1498831137 PROGRESS=0.000000 CURRENT_OP=UPDATE_STATUS_FINALIZING NEW_VERSION=1409.5.0 NEW_SIZE=277339901 LAST_CHECKED_TIME=1498831137 PROGRESS=0.000000 CURRENT_OP=UPDATE_STATUS_FINALIZING NEW_VERSION=1409.5.0 NEW_SIZE=277339901 LAST_CHECKED_TIME=1498831137 PROGRESS=0.000000 CURRENT_OP=UPDATE_STATUS_FINALIZING NEW_VERSION=1409.5.0 NEW_SIZE=277339901 LAST_CHECKED_TIME=1498831137 PROGRESS=0.000000 CURRENT_OP=UPDATE_STATUS_FINALIZING NEW_VERSION=1409.5.0 NEW_SIZE=277339901 LAST_CHECKED_TIME=1498831137 PROGRESS=0.000000 CURRENT_OP=UPDATE_STATUS_UPDATED_NEED_REBOOT NEW_VERSION=1409.5.0 NEW_SIZE=277339901 I0630 14:00:12.688184 4616 update_engine_client.cc:194] Update succeeded -- reboot needed.

So I reboot again, and login again:

core@ip-10-0-53-106 ~/.ssh $ ssh core@10.0.112.77 Enter passphrase for key '/home/core/.ssh/id_rsa': Last login: Fri Jun 30 13:58:42 UTC 2017 from 10.0.53.106 on pts/0 Container Linux by CoreOS stable (1409.5.0) core@production-etcd-1 ~ $ uptime 14:01:18 up 0 min, 1 user, load average: 0.00, 0.00, 0.00 core@production-etcd-1 ~ $ sudo -i production-etcd-1 ~ # ps ax | grep etc 873 pts/0 S+ 0:00 grep --colour=auto etc production-etcd-1 ~ # systemctl start etcd Failed to start etcd.service: Unit etcd.service not found.

Is this a bug in the update?

Tectonic console web client still says "Ingress Error" in the title, and just displays the tectonic logo and under it 503

Service Unavailable

So the containters on the masters are running and the web interface is alive, but not connecting. With etcd down, it does make sense. The kubectl, also, doesn't have where to connect.

$ kubectl get po error: error fetching provider config: invalid character '<' looking for beginning of value

and I'm, again, out of ideas.... I'll try another update on Monday, but it looks like it's related to the issue with etcd, and it's related to the code and not to what we used it for.... We've just deployed a few apps, nothing special. That's the biggest issue - we were hoping of pushing this into production...

I'm getting a good lesson on how the system works, but I'd love it to finally get to work again :) Any more ideas from the community (or anyone from the coreos team) are VERY appreciated at this point.

nkrgovic commented 7 years ago

Re-ran update on the etcd nodes, nothing to be updated - no effect.

Simply put, after the update etcd nodes no longer have the etcd service installed. From 3 nodes only 1 has etcd running, and it's empty other then the semaphore showing what looks like notification that's in the update process. Don't see a way of resolving on my own.

Is there a way to manually re-run the installer? Add etcd back to the nodes for etc? This is a bit silly. If anyone has any ideas, I'd appreciate the help - but it looks like the issue in the installer deployed config / update procedure.

nkrgovic commented 7 years ago

etcd is still down, but now the kubectl gets new errors:

$ kubectl get po error: error fetching provider config: Get https://production.piratetech.io/identity/.well-known/openid-configuration: x509: certificate signed by unknown authority

Actually tried to re-run the installer, to see if it connects, and if it can do something, and got the same error in the console from this I was running the installer, with the message to "contact tectonic support".

bobhenkel commented 7 years ago

I was getting

$ kubectl get pods
error: error fetching provider config: invalid character '<' looking for beginning of value

What I did to make this happen was go into AWS and up my worker auto scale group from 3 to 6. During about a 5 minute window I kept getting error: error fetching provider config: invalid character '<' looking for beginning of value. Then kubectl started getting good responses back, aka kubectl get pods worked.

hapnermw commented 7 years ago

I had the same problem with an AWS cluster installed on 7/6 using AWS Installer GUI with 1 master and 2 workers using default configs.

After install, console and kubectl worked without issue. Added one additional static user. Came back on 7/14:

console ingress error 503 kubectl error: error fetching provider config: invalid character '<' looking for beginning of value

All looks good at the AWS layer.

What's up with this? A simple cluster that falls over while 'idling' for a few days is not good!

There doesn't appear to be a way to recover from this.

I'm guessing that this is some form of tectonic identity problem. None of the master troubleshooting info was any help. Nothing obvious in the logs.

There doesn't appear to be a way to use the apiserver locally from within master.

Reboot master - the ELB console health check now fails so cluster appears to be hosed.

Destroyed and re-installed cluster.

squat commented 7 years ago

The reason kubectl complains with kubectl error: error fetching provider config: invalid character '<' looking for beginning of value is that it cannot reach the OIDC provider, i.e dex. kubectl expects a JSON response containing the OIDC well known configuration, but instead receives an HTML response from the ingress default backend because dex is not available at the requested address. To debug you need to check if dex is running:

curl -k <your-console-url>/identity/.well-known/openid-configuration

If this works, i.e. returns valid OIDC JSON, then we can debug further. If this returns HTML, then either dex is not running or ingress is broken. SSH onto your nodes and docker ps | grep identity to see if identity is running. If it's not, then there is an issue with dex dying.

hapnermw commented 7 years ago

Thanks squat,

All is currently working. I'll proceed with your steps if it fails again.

If identity does die. How can it be restarted given that all cluster admin relies on it running?

Here's the identity container running on the worker it is provisioned to.

core@ip-10-0-71-40 ~ $ docker ps | grep identity 4177e1e2ef52 quay.io/coreos/dex@sha256:ceee787f11b20e3a5b9e562a7d60afc51b1bb13fb8ca576046a24d0ebdd25d88 "/usr/local/bin/dex s" 4 hours ago Up 4 hours k8s_tectonic-identity_tectonic-identity-3807190471-90zhb_tectonic-system_dcda0bf4-6976-11e7-9221-0e8a57fba57e_0 8155c7cabe21 gcr.io/google_containers/pause-amd64:3.0 "/pause" 4 hours ago Up 4 hours k8s_POD_tectonic-identity-3807190471-90zhb_tectonic-system_dcda0bf4-6976-11e7-9221-0e8a57fba57e_0

squat commented 7 years ago

@hapnermw if identity does die, then our goal would be to inspect the logs to see what caused the failure: docker logs 4177e1e2ef52 > identity-logs.txt. If the cluster fails the same way but identity is still running then the problem is likely an issue with ingress or something involved in the routing of the request to the identity pod, e.g. ingress elb, ingress DNS etc

hapnermw commented 7 years ago

OK, thanks.

Since the identity pod is 'Always Restart', I'm assuming that either there was some condition that got the identity pod into a restart-failure loop; or, that there was some kubernetes problem that took down identity along with other things.

Since reboot of the master resulted in a master that that failed its ELB health check, the later may be the case. I'm assuming the cluster is supposed to survive a master reboot. I'll give this try now to see what happens.

I will not be trusting Tectonic for real work until I understand what happened and how to recover from it.

When this problem occurred earlier, I did read the 'Failure domains of Tectonic Identity'. It asserts that

'If Tectonic Identity is down and the user is unable to login via the Tectonic console they can make use of the kubeconfig generated by the installer which can be found in the assets folder. This will allow the user to access the kubernetes API directly.'

This appears to refer to the use of kubectl to recover from a bad identity configuration as described at https://coreos.com/tectonic/docs/latest/admin/assets-zip.html

If the identity service itself has failed and won't restart, this doesn't help. Perhaps someone in Tectonic docs should clarify the 'This will allow the user to access the kubernetes API directly.' statement.

miracle2k commented 7 years ago

When I had this problem - twice, after setting up a new cluster - my impression was that etcd went down. Set up a cluster with three masters (which would then give me clustered etcd), and since then, it's up.

hapnermw commented 7 years ago

Thanks miracle2k,

I understand that multiple master nodes (at least 3) is required for practical cluster availability. And, that partitioning of etcd to separate nodes is useful for scaling.

On the other hand, if single master clusters have corruption issues (vs availability/scaling issues) this is not good. Effectively, multiple masters may be, in some way, masking master corruption issues.

fassmus commented 7 years ago

Had the exact same issue. Set up a new Tectonic 1.6.7-tectonic.2 cluster with 1 etcd, 1 master, 3 worker on AWS using graphical installer and deployed some custom services. A couple of days later Tectonic console stopped responding and kubectl came back with:

error: error fetching provider config: Get https://.../identity/.well-known/openid-configuration: EOF

Checking each nodes docker processes showed that only the bare minimum of k8s services were running. No identity, no tectonic console and non of the custom services.

Looking at the etcd server showed that it was empty except of /coreos.com/updateengine.

Rebooting each machine did not help. So finally I removed /opt/tectonic/init_bootkube.done from the master followed by

systemctl start bootkube

Now my cluster is up and running and all my custom services are back. So far so good.

But, rechecking etcd shows that still only /coreos.com/updateengine is visible. I have no clue where the cluster state is currently stored. Any ideas?

Still not sure if this is a permanent solution.

lfittl commented 7 years ago

@fassmus Thank you for sharing how you resolved this - I ran into the exact same issue with the Tectonic console and identity being unavailable, a few hours after upgrading Tectonic (but not sure if related).

Re-running bootkube seems to have fixed things, and the cluster is again in an operating state.

I assume there must be an explanation though for whats happening? (I've had a really hard time debugging this, let alone finding any log files that explain whats going on)

jolcese commented 7 years ago

Facing the exact same problem. After some days the cluster on AWS stops responding. Using version v1.7.5+coreos.1