Add Recovery Logic for Failed Pod

Signed-off-by: duyanghao 1294057873@qq.com

What changes were proposed in this pull request?

Add recovery logic for failed pod and fix MEM_EXCEEDED_EXIT_CODE constant.

How was this patch tested?

Manual tests show successful for recovery of failed pod as below:

make one executor pod fail(register itself failure)
driver can discover the failed pod
driver allocates a new executor pod

spark.executor.instances=5

# kubectl get pods -n=xxx -a -o wide|grep spark-debug-sar-test8
spark-debug-sar-test8           1/1       Completed     0          3m        192.168.25.92    x.x.x.x
spark-debug-sar-test8-exec-1    1/1       Completed     0          3m        192.168.25.94    x.x.x.x
spark-debug-sar-test8-exec-2    1/1       Completed     0          3m        192.168.25.93    x.x.x.x
spark-debug-sar-test8-exec-3    0/1       Error       0          3m        192.168.11.31    x.x.x.x
spark-debug-sar-test8-exec-4    0/1       Error       0          3m        192.168.11.37    x.x.x.x
spark-debug-sar-test8-exec-5    0/1       Error       0          3m        192.168.11.44    x.x.x.x
spark-debug-sar-test8-exec-6    1/1       Completed     0          48s       192.168.25.99    x.x.x.x
spark-debug-sar-test8-exec-7    1/1       Completed     0          48s       192.168.25.95    x.x.x.x
spark-debug-sar-test8-exec-8    1/1       Completed     0          48s       192.168.25.97    x.x.x.x

apache-spark-on-k8s / spark

Add Recovery Logic for Failed Pod #624

What changes were proposed in this pull request?

How was this patch tested?