Open ash211 opened 7 years ago
Hmm. Maybe @varunkatta? I thought the driver watches executors and tries to figure out loss reasons.
This should be reproducible via setting spark.kubernetes.executor.memoryOverhead=0
and running a shuffle
Driver couldn't learn the loss reason in this case (may be because of missed K8s events ?). I will try to repro this and share what we discover in the process.
See this in driver logs:
and this in pod describe (redacted a bit):
The line in driver logs that says
Executor lost for unknown reasons
should instead say something about the pod being OOMKilled