job failed after shuffle pod restart

apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/

https://spark.apache.org/

Apache License 2.0

612 stars 118 forks source link

job failed after shuffle pod restart #606

Open ChenLingPeng opened 6 years ago

ChenLingPeng commented 6 years ago

How to reproduce

Submit a job like PageRank use external shuffle service
After executors running, stop some external-shuffle-service pod in executor's host
The external-shuffle-service pod will restart with some new pod IP
Driver exit with failed status

See the log in driver/executor, it shows pod always try to fetch block using old shuffle-pod-ip

liyinan926 commented 6 years ago

Yes, we use the shuffle pod IP to identify the shuffle pod and set spark.shuffle.service.host to the IP. So it seems shuffle pods need sticky network identify.

weixiuli commented 6 years ago

May avoid this issus by using hostnetwork ?