3 controller with 3 worker nodes tectonic failing - API errors

rushins commented 7 years ago

I am doing test with 3 Masters (controllers) and 3 workers deployment with tectonic 1.63 and installation failing with API server ? we are using 10 Node free license . Is there any limitation that only one MASTER (Controller) supported ? I tried with 1 master and 5 workers, it worked like charm but 3 masters and 3 worker always failing.

Can you just through some information why this is failing.

I get this error.

have you seen this error?? sudo tail -300 bootstrap-kube-apiserver-ld4601.wdf.sap.corp_kube-system_kube-apiserver-8c7933423beaba4d7035b90aee734b7f5369e05aabbb563a6df4b860ca1e517d.log {"log":"Error: error waiting for etcd connection: timed out waiting for the condition\n","stream":"stderr","time":"2017-08-24T19:24:43.141656508Z"} {"log":"Error: error waiting for etcd connection: timed out waiting for the condition\n","stream":"stderr","time":"2017-08-24T19:24:43.14171726Z"}

kbrwn commented 7 years ago

There is no limitation on the number of master nodes. We encourage more than one master to prevent clusters from needing disaster recovery in the event of a master node failure.

The issue is likely that the kube-apiserver pod on the additional master nodes is unable to connect to etcd. This could be due to network latency or perhaps the etcd cluster being taxed too heavily. Be sure that you etcd cluster is following the guidelines for running etcdv3: https://github.com/coreos/etcd/blob/master/Documentation/op-guide/hardware.md

You can check the health of the etcd cluster via etcdctl cluster-health.

rushins commented 7 years ago

Hi Brwn,

yes i did some troubleshoot of ETCD to find the same you said. etcdctl cluster-health is not showing anything related to peer-masters. i see "ETCD" is the bottleneck of problem. Regards to hardware i have pretty high computing servers (512 GB RAM server with 64 CPU and 2 TB disk) so i dont think any hardware related issues as they are pretty powerfull baremetals.

rushins commented 7 years ago

Hi Brwn.

Any updates you can tell from our infrastructure why this can fail on high computing hardware.

robszumski commented 7 years ago

Can you describe how these machines are networked together?

rushins commented 7 years ago

HI Rob,

these machines connected with 10GB network cards with one VLAN only and all the servers DNS have proper Fully qualified domains too, like i said the same hosts with 1 master and 5 workers worked but when we delete all(RAID destroy) and try with 3 masters and 3 workers always failing.

Are you in https://kubernetes.slack.com , also if you need i can share my system through https://tmate.io/ ssh console where you can see the errors and help .

let me know.

robszumski commented 7 years ago

Are you connecting an external etcd cluster into this Tectonic install? or are you using the installer to create the etcd cluster? Are you using the self-hosted etcd option (I would suggest not right now)

It looks like there is definitely a networking issue that is preventing start up. It doesn't sound like a resource issue.

rushins commented 7 years ago

use the option to "installer to create the etcd cluster"

rushins commented 7 years ago

any update on this Rob.

rushins commented 7 years ago

hello rob.

any update ...

robszumski commented 7 years ago

When you SSH to those etcd machines, do you see anything that looks wrong in the journal?

https://coreos.com/tectonic/docs/latest/troubleshooting/etcd-nodes.html

rushins commented 7 years ago

Hi ROb,

Yes, i checked "ETCD-member" and seems running , but not sure what other to check

See the below output from the command you asked to check.

systemctl status etcd-member ● etcd-member.service - etcd (System Application Container) Loaded: loaded (/usr/lib/systemd/system/etcd-member.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/etcd-member.service.d └─40-etcd-cluster.conf Active: activating (start) since Fri 2017-09-01 18:02:59 UTC; 1 weeks 4 days ago Docs: https://github.com/coreos/etcd Process: 827 ExecStartPre=/usr/bin/rkt rm --uuid-file=/var/lib/coreos/etcd-member-wrapper.uuid (code=exited, status=0/SUCCESS) Process: 788 ExecStartPre=/usr/bin/mkdir --parents /var/lib/coreos (code=exited, status=0/SUCCESS) Main PID: 1158 (etcd) Tasks: 12 (limit: 32768) Memory: 138.2M CPU: 11h 38min 33.574s CGroup: /system.slice/etcd-member.service └─1158 /usr/local/bin/etcd

Sep 13 15:49:16 lvpal3021.pal.sap.corp etcd-wrapper[1158]: 2017-09-13 15:49:16.814292 I | raft: 63b9e674aa4f3b73 is starting a new election at term 943519 Sep 13 15:49:16 lvpal3021.pal.sap.corp etcd-wrapper[1158]: 2017-09-13 15:49:16.814356 I | raft: 63b9e674aa4f3b73 became candidate at term 943520 Sep 13 15:49:16 lvpal3021.pal.sap.corp etcd-wrapper[1158]: 2017-09-13 15:49:16.814374 I | raft: 63b9e674aa4f3b73 received MsgVoteResp from 63b9e674aa4f3b73 at term 943520 Sep 13 15:49:16 lvpal3021.pal.sap.corp etcd-wrapper[1158]: 2017-09-13 15:49:16.814392 I | raft: 63b9e674aa4f3b73 [logterm: 1, index: 3] sent MsgVote request to b08cc664c4d49993 at term 943520 Sep 13 15:49:16 lvpal3021.pal.sap.corp etcd-wrapper[1158]: 2017-09-13 15:49:16.814409 I | raft: 63b9e674aa4f3b73 [logterm: 1, index: 3] sent MsgVote request to a3053e2b8b9083d2 at term 943520 Sep 13 15:49:18 lvpal3021.pal.sap.corp etcd-wrapper[1158]: 2017-09-13 15:49:18.014356 I | raft: 63b9e674aa4f3b73 is starting a new election at term 943520 Sep 13 15:49:18 lvpal3021.pal.sap.corp etcd-wrapper[1158]: 2017-09-13 15:49:18.014405 I | raft: 63b9e674aa4f3b73 became candidate at term 943521 Sep 13 15:49:18 lvpal3021.pal.sap.corp etcd-wrapper[1158]: 2017-09-13 15:49:18.014422 I | raft: 63b9e674aa4f3b73 received MsgVoteResp from 63b9e674aa4f3b73 at term 943521 Sep 13 15:49:18 lvpal3021.pal.sap.corp etcd-wrapper[1158]: 2017-09-13 15:49:18.014439 I | raft: 63b9e674aa4f3b73 [logterm: 1, index: 3] sent MsgVote request to a3053e2b8b9083d2 at term 943521 Sep 13 15:49:18 lvpal3021.pal.sap.corp etcd-wrapper[1158]: 2017-09-13 15:49:18.014457 I | raft: 63b9e674aa4f3b73 [logterm: 1, index: 3] sent MsgVote request to b08cc664c4d49993 at term 943521

rushins commented 7 years ago

Hi Rob,

Any updates based on the above information. i still have issue ?

rushins commented 7 years ago

Hello Rob, Can you help on this. it is very critical as we cannot deploy with 3 masters . Like i said 1 Master always works , i have tested 3 masters with 3 workers on many infrastructures and its always the same result.

BR Rushi.

rushins commented 7 years ago

Hi Rob,

Can you provide some information or check with someone from your team what will be the solution.

rushins commented 7 years ago

Hello

Does anyone have experienced this

haimari commented 6 years ago

installed tectonic 1.7.9 on aws and i see these errors on the newly created masters: The installation then hangs...

{"log":"Error: error waiting for etcd connection: timed out waiting for the condition\n","stream":"stderr","time":"2017-12-06T11:46:20.82365843Z"} {"log":"Error: error waiting for etcd connection: timed out waiting for the condition\n","stream":"stderr","time":"2017-12-06T11:46:20.823685648Z"}

coreos / tectonic-forum

3 controller with 3 worker nodes tectonic failing - API errors #186