PaddlePaddle / PaddleCloud

PaddlePaddle Docker images and K8s operators for PaddleOCR/Detection developers to use on public/private cloud.
Apache License 2.0
284 stars 77 forks source link

Kubernetes create so much Pods when a trainer failed #149

Closed Yancey1989 closed 7 years ago

Yancey1989 commented 7 years ago

The Paddle trainers is scheduled by Kubernetes Job, when any Pod is failed, Kubernetes will start up a new Pod, so if the upload train.py exists with non-zero, there will be more and more Pod with an Error status, and never stop only when user kills the job in manual.

here is the design doc for backoff policy and failed pod limit.

typhoonzero commented 7 years ago

Emergent issue!

typhoonzero commented 7 years ago

We can have a work around like periodically check job status and put the job to fail it too many pods are failing. I'll try to add this feature today.

Yancey1989 commented 7 years ago

Thanks for @emailweixu 's suggestion, maybe we can add a function at before: https://github.com/PaddlePaddle/cloud/blob/develop/docker/paddle_k8s#L28, while the failed times execute the threshold, return 0 and write message into /dev/termination-log , so that paddlecloud comman-line will fetch the error message.

typhoonzero commented 7 years ago

Will test and fix~

pineking commented 7 years ago

@Yancey1989 @typhoonzero 这个不停创建新 Pod 怎么解决的?设置次数限制?

typhoonzero commented 7 years ago

强制exit 0。需要验证下。