apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

about the "RetrieveSparkAppConfig" msg #567

Closed pierric closed 6 years ago

pierric commented 6 years ago

I often find the executor pod reports that the connection to driver is not active and will shutdown in 120s. And looking into the driver pod's log, I find that driver is waiting for "RegisterExecutor" msg from the executor. That's sound very odd, driver and executor are waiting for each other!

I traced and find something does not sound right in the code, but I am not really sure. Could someone take a look?

I didn't enable the shuffle service. I guess that leads to my strange issue.

Shouldn't a configuration always be constructed and sent back?

foxish commented 6 years ago

I don't recall off the top of my head what the intended behavior was with RetrieveSparkAppConfig. Can you detail a case where this happens? Maybe post the spark-submit commandline you're using as well? We have tested runs that last 10+ minutes and haven't seen that issue with the applications.

pierric commented 6 years ago

ok, I will get some information of my k8s and the submit command on Friday. don't have access to them right now.

liyinan926 commented 6 years ago

Reply of RetrieveSparkAppConfig is triggered through orElse(super.receiveAndReply(context)) when shuffleManager.isDefined is false. So a reply is always expected.

pierric commented 6 years ago

hmm, that's right. I missed that orElse part. Now I find the default handling of RetrieveSparkAppConfig. then I will need to debug my issue a little bit more.