Closed Yancey1989 closed 7 years ago
Emergent issue!
We can have a work around like periodically check job status and put the job to fail it too many pods are failing. I'll try to add this feature today.
Thanks for @emailweixu 's suggestion, maybe we can add a function at before: https://github.com/PaddlePaddle/cloud/blob/develop/docker/paddle_k8s#L28, while the failed times execute the threshold, return 0 and write message into /dev/termination-log
, so that paddlecloud comman-line will fetch the error message.
Will test and fix~
@Yancey1989 @typhoonzero 这个不停创建新 Pod 怎么解决的?设置次数限制?
强制exit 0。需要验证下。
The Paddle trainers is scheduled by Kubernetes Job, when any Pod is failed, Kubernetes will start up a new Pod, so if the upload
train.py
exists with non-zero, there will be more and more Pod with an Error status, and never stop only when user kills the job in manual.here is the design doc for backoff policy and failed pod limit.