SeldonIO / seldon-server

Machine Learning Platform and Recommendation Engine built on Kubernetes
https://www.seldon.io/
Apache License 2.0
1.47k stars 300 forks source link

Error in reuters-example in the Serve Recommendations part #21

Closed ghost closed 8 years ago

ghost commented 8 years ago

Hello, I am trying to run reuters-example following this http://docs.seldon.io/content-recommendation-example.html but when I execute run_recommendation_microservice.sh reuters-example seldonio/reuters-example 1.0 reuters I have this error:

deployment "reuters-example" configured service "reuters-example" configured WARNING:kazoo.client:Connection dropped: socket connection error: None WARNING:kazoo.client:Connection dropped: socket connection error: None connecting to zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181Traceback (most recent call last): File "/opt/conda/bin/seldon-cli", line 4, in <module> __import__('pkg_resources').run_script('seldon==2.0.1', 'seldon-cli') File "/opt/conda/lib/python2.7/site-packages/setuptools-18.5-py2.7.egg/pkg_resources/__init__.py", line 742, in run_script File "/opt/conda/lib/python2.7/site-packages/setuptools-18.5-py2.7.egg/pkg_resources/__init__.py", line 1667, in run_script File "/opt/conda/lib/python2.7/site-packages/seldon-2.0.1-py2.7.egg/EGG-INFO/scripts/seldon-cli", line 5, in <module> seldon.cli.start_seldoncli() File "/opt/conda/lib/python2.7/site-packages/seldon-2.0.1-py2.7.egg/seldon/cli/__init__.py", line 3, in start_seldoncli cli_main.main() File "/opt/conda/lib/python2.7/site-packages/seldon-2.0.1-py2.7.egg/seldon/cli/cli_main.py", line 346, in main start_zk_client(opts) File "/opt/conda/lib/python2.7/site-packages/seldon-2.0.1-py2.7.egg/seldon/cli/cli_main.py", line 301, in start_zk_client gdata["zk_client"].start() File "/opt/conda/lib/python2.7/site-packages/kazoo/client.py", line 546, in start raise self.handler.timeout_exception("Connection time-out") kazoo.handlers.threading.KazooTimeoutError: Connection time-out error: error executing remote command: Error executing command in container: Error executing in Docker Container: 1

PS: I have installed seldon in a cloud kubernetes cluster and i have my DNS pod which is working well.

Here are my running pods: default influxdb-grafana-7mpow 2/2 Running 0 2h kubernetes-2 default kafka-controller-q5n79 1/1 Running 14 2h kubernetes-2 default memcached1-0mm1t 1/1 Running 0 2h kubernetes-2 default memcached2-xy165 1/1 Running 0 2h kubernetes-2 default reuters-example-1808766350-tq0ic 1/1 Running 0 2h kubernetes-2 default seldon-control 1/1 Running 0 2h kubernetes-3 default seldon-server-1847842434-nyksb 2/2 Running 35 2h kubernetes-2 default spark-master-controller-xuqfw 1/1 Running 0 19h kubernetes-3 default spark-worker-controller-f2aum 1/1 Running 0 1d kubernetes-3 default spark-worker-controller-l131c 1/1 Running 0 19h kubernetes-3 my-system elasticsearch-logging-v1-d13tc 1/1 Running 0 1d kubernetes-3 my-system fluentd-elasticsearch-kubernetes-2 1/1 Running 0 2h kubernetes-2 my-system fluentd-elasticsearch-kubernetes-3 1/1 Terminating 3 1d kubernetes-3 my-system heapster-f9lsf 2/2 Running 0 1d kubernetes-3 my-system infludb-grafana-m3fqi 2/2 Running 0 1d kubernetes-3 my-system kibana-logging-v1-7c8qr 1/1 Running 2 1d kubernetes-3 my-system kube-apiserver-kubernetes-1 1/1 Running 0 2h kubernetes-1 my-system kube-controller-manager-kubernetes-1 1/1 Running 55 2h kubernetes-1 my-system kube-dns-v8-fdk73 5/5 Running 0 1d kubernetes-3 my-system kube-scheduler-kubernetes-1 1/1 Running 44 2h kubernetes-1 my-system scheduler-master-kubernetes-1 2/2 Running 15 2h kubernetes-1 Could please help me or even give me an idea to explore. Thank you in advance

Nadia

ukclivecox commented 8 years ago

Hi Nadia,

I don't see zookeeper pods running and that is the error above.

Can you check it can be found by doing an nslookup inside seldon-contro pod, e.g. by doing:

kubectl exec -ti seldon-control -- /bin/bash

root@seldon-control:/home/seldon# nslookup zookeeper-1 Server: 10.0.0.10 Address: 10.0.0.10#53

Name: zookeeper-1.default.svc.cluster.local Address: 10.0.0.241

ghost commented 8 years ago

Here the result for nslookup command: `Server: 10.10.10.10 Address: 10.10.10.10#53

Name: zookeeper-1.default.svc....net Address: 10.10.10.213 `

ghost commented 8 years ago

Thank you I have created zookeepers pods and now it's working

ukclivecox commented 8 years ago

Nadia, That is good. seldon-up.sh should create the zookeeper ppds by default. Is there any reason why they were not running?

ghost commented 8 years ago

I think it's me how have deleted them but forgotten to relaunch the script. But i have an other issue which is concerning my seldon-server pod. It is always restarting because of CrashLoopBackOff status. Here is the desciption of the pod: `Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message


2h 3m 524 {kubelet kubernetes-2} spec.containers{seldon-server} Warning BackOff Back-off restarting failed docker container 2h 3m 509 {kubelet kubernetes-2} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "seldon-server" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=seldon-server pod=seldon-server-1847842434-nyksb_default(0f531258-3d28-11e6-8f46-00505601b183)"

19m 2m 5 {kubelet kubernetes-2} spec.containers{seldon-server} Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 404 3h 1m 47 {kubelet kubernetes-2} spec.containers{seldon-server} Normal Pulled Container image "seldonio/seldon-server:1.3.4" already present on machine 2h 1m 38 {kubelet kubernetes-2} spec.containers{seldon-server} Normal Killing (events with common reason combined) 2h 1m 40 {kubelet kubernetes-2} spec.containers{seldon-server} Normal Created (events with common reason combined) 2h 1m 40 {kubelet kubernetes-2} spec.containers{seldon-server} Normal Started (events with common reason combined) ` Do you have any idea?

ukclivecox commented 8 years ago

It could be a memory issue. How much memory do you have on your cluster nodes? Also, can you show us the logs, e.g. (your server will have different id below):

kubectl logs seldon-server-1847842434-dcqns seldon-server

ghost commented 8 years ago

Here is the logs: INFO: validateJarFile(/usr/local/tomcat/webapps/ROOT/WEB-INF/lib/servlet-api-2.5.jar) - jar not loaded. See Servlet Spec 3.0, section 10.7.2. Offending class: javax/servlet/Servlet.class Jun 28, 2016 3:13:44 PM org.apache.catalina.startup.TldConfig execute INFO: At least one JAR was scanned for TLDs yet contained no TLDs. Enable debug logging for this logger for a complete list of JARs that were scanned but no TLDs were found in them. Skipping unneeded JARs during scanning can improve startup time and JSP compilation time. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/seldon/ROOT/WEB-INF/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/seldon/ROOT/WEB-INF/lib/slf4j-log4j12-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Jun 28, 2016 3:13:46 PM org.apache.catalina.core.StandardContext startInternal SEVERE: One or more listeners failed to start. Full details will be found in the appropriate container log file Jun 28, 2016 3:13:46 PM org.apache.catalina.core.StandardContext startInternal SEVERE: Context [] startup failed due to previous errors Jun 28, 2016 3:13:46 PM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesThreads SEVERE: The web application [] appears to have started a thread named [localhost-startStop-1-SendThread(10.10.10.213:2181)] but has failed to stop it. This is very likely to create a memory leak. Jun 28, 2016 3:13:46 PM org.apache.catalina.startup.HostConfig deployDirectory INFO: Deployment of web application directory /usr/local/tomcat/webapps/ROOT has finished in 9,573 ms Jun 28, 2016 3:13:46 PM org.apache.coyote.AbstractProtocol start INFO: Starting ProtocolHandler ["http-bio-8080"] Jun 28, 2016 3:13:46 PM org.apache.coyote.AbstractProtocol start INFO: Starting ProtocolHandler ["ajp-bio-8009"] Jun 28, 2016 3:13:46 PM org.apache.catalina.startup.Catalina start INFO: Server startup in 10941 ms Jun 28, 2016 3:13:46 PM org.apache.catalina.loader.WebappClassLoaderBase loadClass INFO: Illegal access: this web application instance has been stopped already. Could not load org.apache.zookeeper.server.ZooTrace. The eventual following stack trace is caused by an error thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access, and has no functional impact. java.lang.IllegalStateException at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1743) at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1701) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1115)

For the memory I have 4Go of RAM for each node.

ukclivecox commented 8 years ago

Can you check for any ERRORS in the seldon-server api.log before the pod is restarted, e.g.

kubectl exec -ti seldon-server-1847842434-dcqns -- /bin/bash

<dcqns:/usr/local/tomcat# cd /home/seldon/logs/base

<4-dcqns:/home/seldon/logs/base# grep ERROR api.log

(or show all lines or last few of log if too long)

ghost commented 8 years ago

Here is what I get: rg.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'actionHistoryProvider': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: io.seldon.api.caching.MemcacheActionHistory io.seldon.api.caching.ActionHistoryProvider.defaultProvider; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'memcacheActionHistory': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: io.seldon.memcache.ExceptionSwallowingMemcachedClient io.seldon.api.caching.MemcacheActionHistory.cacheClient; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'exceptionSwallowingMemcachedClient' defined in file [/home/seldon/ROOT/WEB-INF/classes/io/seldon/memcache/ExceptionSwallowingMemcachedClient.class]: Instantiation of bean failed; nested exception is org.springframework.beans.BeanInstantiationException: Could not instantiate bean class [io.seldon.memcache.ExceptionSwallowingMemcachedClient]: Constructor threw exception; nested exception is java.lang.Exception: *Warning* Memcached NOT initialized! Caused by: org.springframework.beans.factory.BeanCreationException: Could not autowire field: io.seldon.api.caching.MemcacheActionHistory io.seldon.api.caching.ActionHistoryProvider.defaultProvider; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'memcacheActionHistory': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: io.seldon.memcache.ExceptionSwallowingMemcachedClient io.seldon.api.caching.MemcacheActionHistory.cacheClient; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'exceptionSwallowingMemcachedClient' defined in file [/home/seldon/ROOT/WEB-INF/classes/io/seldon/memcache/ExceptionSwallowingMemcachedClient.class]: Instantiation of bean failed; nested exception is org.springframework.beans.BeanInstantiationException: Could not instantiate bean class [io.seldon.memcache.ExceptionSwallowingMemcachedClient]: Constructor threw exception; nested exception is java.lang.Exception: *Warning* Memcached NOT initialized! Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'memcacheActionHistory': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: io.seldon.memcache.ExceptionSwallowingMemcachedClient io.seldon.api.caching.MemcacheActionHistory.cacheClient; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'exceptionSwallowingMemcachedClient' defined in file [/home/seldon/ROOT/WEB-INF/classes/io/seldon/memcache/ExceptionSwallowingMemcachedClient.class]: Instantiation of bean failed; nested exception is org.springframework.beans.BeanInstantiationException: Could not instantiate bean class [io.seldon.memcache.ExceptionSwallowingMemcachedClient]: Constructor threw exception; nested exception is java.lang.Exception: *Warning* Memcached NOT initialized! Caused by: org.springframework.beans.factory.BeanCreationException: Could not autowire field: io.seldon.memcache.ExceptionSwallowingMemcachedClient io.seldon.api.caching.MemcacheActionHistory.cacheClient; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'exceptionSwallowingMemcachedClient' defined in file [/home/seldon/ROOT/WEB-INF/classes/io/seldon/memcache/ExceptionSwallowingMemcachedClient.class]: Instantiation of bean failed; nested exception is org.springframework.beans.BeanInstantiationException: Could not instantiate bean class [io.seldon.memcache.ExceptionSwallowingMemcachedClient]: Constructor threw exception; nested exception is java.lang.Exception: *Warning* Memcached NOT initialized! Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'exceptionSwallowingMemcachedClient' defined in file [/home/seldon/ROOT/WEB-INF/classes/io/seldon/memcache/ExceptionSwallowingMemcachedClient.class]: Instantiation of bean failed; nested exception is org.springframework.beans.BeanInstantiationException: Could not instantiate bean class [io.seldon.memcache.ExceptionSwallowingMemcachedClient]: Constructor threw exception; nested exception is java.lang.Exception: *Warning* Memcached NOT initialized!

ukclivecox commented 8 years ago

It again looks like the seldon-server pod can't contact zookeeper services. Try to contact zookeeper from inside the seldon-server pod, it should look like:

/home/seldon/logs/base# telnet zookeeper-1 2181
Trying 10.0.0.241... Connected to zookeeper-1.default.svc.cluster.local. Escape character is '^]'.

ghost commented 8 years ago

No it is not zookeeper connection problem. I have got the good result: Trying 10.10.10.213... Connected to zookeeper-1.default.svc.....net. Escape character is '^]'.

ukclivecox commented 8 years ago

Strange as "org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss" imples it can't connect to Zookeeper.

Can you attach the whole log if possible?

ghost commented 8 years ago

He is the whole logs logs.txt wholeApiLogs.txt

ukclivecox commented 8 years ago

ok - that error is because the memcache pods are not running.

It sounds like you are having problems starting all the required pods on seldon-up.sh

ghost commented 8 years ago

But they are running my memcache pods: influxdb-grafana-7mpow 2/2 Running 0 4h kafka-controller-q5n79 1/1 Running 18 4h memcached1-0mm1t 1/1 Running 0 4h memcached2-xy165 1/1 Running 0 4h mysql 1/1 Running 0 1h reuters-example-1808766350-tq0ic 1/1 Running 0 4h seldon-control 1/1 Running 0 4h seldon-server-1847842434-nyksb 2/2 Running 62 4h spark-master-controller-xuqfw 1/1 Running 0 21h spark-worker-controller-f2aum 1/1 Running 0 1d spark-worker-controller-l131c 1/1 Running 0 21h td-agent-server 1/1 Running 0 1h zookeeper-1 1/1 Running 0 1h zookeeper-2 1/1 Running 0 1h zookeeper-3 1/1 Running 0 1h

ukclivecox commented 8 years ago

That is strange. Can you connect to them from the seldon-server pod?

kubectl exec -ti seldon-server-1847842434-zbzlb -- /bin/bash <34-zbzlb:/usr/local/tomcat# telnet memcached1 11211
Trying 10.0.0.229... Connected to memcached1.default.svc.cluster.local. Escape character is '^]'.

ghost commented 8 years ago

The same result Trying 10.10.10.152... Connected to memcached1.default.svc....net. Escape character is '^]'.

ukclivecox commented 8 years ago

I think its because the zookeeper servers were not running so when you restarted them they have not been updated with the correct state information. There are 2 options:

  1. do a seldon-down.sh and then a seldon-up.sh t reset the system or
  2. try seldon-cli client --action zk_push and then restart the seldon-server pod
ghost commented 8 years ago

Hi cliveseldon, I think what you have suggested has corrected the problem. Thank you very much!