Cluster behaves unstable(etcd and kubelet-related issues)

megastallman commented 6 years ago

Issue Report Template

Tectonic Version

1.7.5-tectonic.1

Environment

Baremetal. The etcd-cluster is deployed automatically with the Tectonic installer WEB-GUI.

Expected Behavior

The cluster should work.

Actual Behavior

Hangs up once a couple of days. Reboots or systemctl restart etcd-member help for some time.

Reproduction Steps

Installed with the Tectonic installer WEB-GUI
Deployed some users' manifests
Other Information

The alert-manager shows alerts like: [FIRING:1] K8SApiserverDown (critical), the cluster becomes slow or stops working. The journalctl -f says: Nov 24 13:38:28 node1.ololo.com etcd-wrapper[1627]: 2017-11-24 13:38:28.466859 W | rafthttp: health check for peer dd9ac7553c520ac2 could not connect: dial tcp: lookup node3.ololo.com on [::1]:53: read udp [::1]:46900->[::1]:53: read: connection refused

The further restarts help for some time. I was trying to tune the etcd timeouts and snapshot intervals, doesn't help much.

kbrwn commented 6 years ago

@megastallman There isn't much advice I can offer without logs from the cluster but one possibility is that the controller nodes are not sufficient for running both the control plane and etcd. You might want to review the hardware requirements.

https://coreos.com/etcd/docs/latest/op-guide/hardware.html https://support.coreos.com/hc/en-us/articles/115004021914

megastallman commented 6 years ago

Thanks @kbrwn! As for my hardware, those are 4 Dell PowerEdge R430 servers, with 64(8*8) cores each, 320GB ram, 10GBit/s and 1700GB of RAID10 HDD. They should be sufficient enough to keep that load.

coreos / tectonic-forum