Closed megastallman closed 6 years ago
@megastallman There isn't much advice I can offer without logs from the cluster but one possibility is that the controller nodes are not sufficient for running both the control plane and etcd. You might want to review the hardware requirements.
https://coreos.com/etcd/docs/latest/op-guide/hardware.html https://support.coreos.com/hc/en-us/articles/115004021914
Thanks @kbrwn! As for my hardware, those are 4 Dell PowerEdge R430 servers, with 64(8*8) cores each, 320GB ram, 10GBit/s and 1700GB of RAID10 HDD. They should be sufficient enough to keep that load.
Issue Report Template
Tectonic Version
1.7.5-tectonic.1
Environment
Baremetal. The etcd-cluster is deployed automatically with the Tectonic installer WEB-GUI.
Expected Behavior
The cluster should work.
Actual Behavior
Hangs up once a couple of days. Reboots or
systemctl restart etcd-member
help for some time.Reproduction Steps
Other Information
The alert-manager shows alerts like:
[FIRING:1] K8SApiserverDown (critical)
, the cluster becomes slow or stops working. Thejournalctl -f
says:Nov 24 13:38:28 node1.ololo.com etcd-wrapper[1627]: 2017-11-24 13:38:28.466859 W | rafthttp: health check for peer dd9ac7553c520ac2 could not connect: dial tcp: lookup node3.ololo.com on [::1]:53: read udp [::1]:46900->[::1]:53: read: connection refused
The further restarts help for some time. I was trying to tune the etcd timeouts and snapshot intervals, doesn't help much.