apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Submit Exec Failure when docker pars is changed and k8s api server restart #428

Open duyanghao opened 7 years ago

duyanghao commented 7 years ago

The submit exits with following errors when i change docker pars(do not restart) or restart k8s api server:

2017-07-05T11:32:42.822851404Z 2017-07-05 11:32:42 WARN WatchConnectionManager:182 - Exec Failure
2017-07-05T11:32:42.822866179Z java.io.EOFException
2017-07-05T11:32:42.822869460Z at okio.RealBufferedSource.require(RealBufferedSource.java:59)
2017-07-05T11:32:42.822872518Z at okio.RealBufferedSource.readByte(RealBufferedSource.java:72)
2017-07-05T11:32:42.822875435Z at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
2017-07-05T11:32:42.822878101Z at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
2017-07-05T11:32:42.822880754Z at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
2017-07-05T11:32:42.822883385Z at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
2017-07-05T11:32:42.822895231Z at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
2017-07-05T11:32:42.822897880Z at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
2017-07-05T11:32:42.822900257Z at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2017-07-05T11:32:42.822902649Z at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2017-07-05T11:32:42.822905015Z at java.lang.Thread.run(Thread.java:745)
2017-07-05T11:32:42.825252889Z 2017-07-05 11:32:42 INFO WatchConnectionManager:352 - Current reconnect backoff is 1000 milliseconds (T0)
2017-07-05T11:32:43.856091750Z 2017-07-05 11:32:43 INFO LoggingPodStatusWatcherImpl:54 - Container final statuses:
2017-07-05T11:32:43.856106549Z 
2017-07-05T11:32:43.856109837Z 
2017-07-05T11:32:43.856112236Z Container name: spark-kubernetes-driver
2017-07-05T11:32:43.856115104Z Container image: xxx
2017-07-05T11:32:43.856117597Z Container state: Running
2017-07-05T11:32:43.856119991Z Container started at: 2017-07-05T01:55:25Z
2017-07-05T11:32:43.856586296Z 2017-07-05 11:32:43 INFO Client:54 - Application xxx finished.

Addition: The result shows that above operations(change docker pars or restart k8s api server) do not have any influence on driver and executors.

duyanghao commented 7 years ago

@erikerlandson,do you have any suggestion?

erikerlandson commented 7 years ago

It makes sense to me that restarting the kube api server could cause the watcher to fail, since the watcher would lose connection to the cluster. Can you explain what you mean about changing docker params?

duyanghao commented 7 years ago

@erikerlandson changing docker params means changing some pars in /etc/sysconfig/docker file(but do not restart docker). i do think it would be more robust if watcher makes some reconnect.

duyanghao commented 7 years ago

@erikerlandson maybe it is not relevant to docker pars change but kubelet aborts.but still i recommend watcher reconnection.

erikerlandson commented 7 years ago

@duyanghao I think if it's possible to make the watcher connections robust across restarts it would be desirable. @foxish, do you have any insights on this one?

duyanghao commented 7 years ago

@erikerlandson @foxish Taking a look at issue 465,maybe we can have these problems solved together.