Unsupported RPCMessage and then not able to spin up worker

leletan commented 6 years ago

I was trying to run the job on minikube v0.22.3 within virtual box on macosx, simulating kubenetes v1.7.5.The master was successfully spun up but worker was not.

Looked into the driver log and seeing following error message: 2017-12-26 09:09:44 ERROR Inbox:91 - Ignoring error org.apache.spark.SparkException: Unsupported message RpcMessage(172.17.0.10:45720,RetrieveSparkAppConfig(1),org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@39e8f5ab) from 172.17.0.10:45720 at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:106) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:155) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)

After this error message, things seems to back to normal: 2017-12-26 09:10:10 INFO KubernetesClusterSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms) 2017-12-26 09:10:10 INFO SharedState:54 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark/work-dir/spark-warehouse'). 2017-12-26 09:10:10 INFO SharedState:54 - Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'. 2017-12-26 09:10:10 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@c1fca2a{/SQL,null,AVAILABLE,@Spark} 2017-12-26 09:10:10 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7c447c76{/SQL/json,null,AVAILABLE,@Spark} 2017-12-26 09:10:10 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6107165{/SQL/execution,null,AVAILABLE,@Spark} 2017-12-26 09:10:10 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@11ebb1b6{/SQL/execution/json,null,AVAILABLE,@Spark}

However, later when worker tasks are launched, there are warnings in the log (as following) indicating there is not enough resource in the cluster, which is not true: 2017-12-26 09:10:29 WARN KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 2017-12-26 09:10:44 WARN KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 2017-12-26 09:10:59 WARN KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

leletan commented 6 years ago

I ran into this issue a couple of times. I tried deleting the minikube and re-install it for a couple of times. There was only once I did not run into this issue thus was able to run the spark job successfully.

leletan commented 6 years ago

It seems vm related issue. Upgraded my virtual box and the issue is gone. Closing.

leletan commented 6 years ago

This time seeing this one on 1.8.5-gke.0 as well. Any idea?

leletan commented 6 years ago

This is due to a spark distribution conflict with the one in the base image, shadowing the spark dependencies in fat jar works. Closing

apache-spark-on-k8s / spark

Unsupported RPCMessage and then not able to spin up worker #586