SeldonIO / seldon-server

Machine Learning Platform and Recommendation Engine built on Kubernetes
https://www.seldon.io/
Apache License 2.0
1.47k stars 300 forks source link

zookeeper cluster is not ready when deploying on kubernetes #56

Closed yu2003w closed 6 years ago

yu2003w commented 7 years ago

Hi, When tried to setup seldon on k8s cluster, it seemed that zookeeper cluster was not running as expected. I got some error as below,

2017-10-20 17:47:32,812 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] - Resolved hostname: zookeeper-2 to address: zookeeper-2/172.30.123.16 2017-10-20 17:47:35,819 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@400] - Cannot open channel to 3 at election address zookeeper-3/172.30.134.85:3888 java.net.NoRouteToHostException: No route to host at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:381) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:426) at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:843) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:822) 2017-10-20 17:47:35,822 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] - Resolved hostname: zookeeper-3 to address: zookeeper-3/172.30.134.85 2017-10-20 17:47:35,823 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000

It seemed that the address setting is not correct. How should I fix such issues?

Thx, Jared

ukclivecox commented 7 years ago

Hi, can you provide some more details and logs:

Have you done a "kubectl get all" to check if the zookeeper nodes are up and ready? Which pod is the above error from? Did you start seldon with seldon-up ? What is the size of your kubernetes cluster?

yu2003w commented 7 years ago

Yes, three zookeeper containers are running well. If three zookeeper is scheduled to different nodes, in each container, I could find the logs. The containers scheduled to the same machine could found each other.

I have 3 masters and 4 work nodes. zookeeper1-467704625-wvl6x 1/1 Running 0 8m 10.130.2.32 host-10-1-241-56 zookeeper2-1006738229-tm7sr 1/1 Running 0 8m 10.129.2.40 host-10-1-130-29 zookeeper3-1545771833-n9pmt 1/1 Running 0 8m 10.130.2.31 host-10-1-241-56

ukclivecox commented 7 years ago

If you are running multi-node then you will need some form of persistent storage : see http://docs.seldon.io/install.html#storage However, the error you show seems to be more of a DNS or network error. Can you exec into the pod that is failing and see if you can connect to the zookeeper-3 host? Have you also checked this error is fatal and has not been recovered from? Also, which pod is failing? Seldon-server?

yu2003w commented 7 years ago

I failed to "curl -kv" services in some node. It seemed it's environment problem in my cluster.

yu2003w commented 6 years ago

This is my environment issue. It seemed that ovs of PaaS is conflict with that of IaaS. Thanks for the help.