Closed iszulcdeepsense closed 1 month ago
$ racetrack deploy job.yaml
succeed while kubectl apply runs in the kubernetes plugin? Where is the timeout?@anders314159
- Is there a failing example for this issue?
A sort of. In case of requesting for too many CPU cores, like this:
resources:
cpu_min: 10M # 10M is not millis, but Mega
Kubernetes says it's okay on kubectl apply
stage, but then it's unable to create a pod. So there's a way for an improvement in getting this information back to a user.
- Will
$ racetrack deploy job.yaml
succeed while kubectl apply runs in the kubernetes plugin? Where is the timeout?
racetrack deploy job.yaml
will fail, but after 15 mintues (after the timeout).
The timeout comes from the liveness probe check, Lifecycle tries to check the /live
endpoint of a job, but since the pod doesn't exist, it waits indefinitely until it appears:
https://github.com/TheRacetrack/racetrack/blob/5bb2f98e7589e5dd549c28c21ea91ff7c29dfe8b/lifecycle/lifecycle/monitor/health.py#L36
- Do we want this feature to be in lifecyle-supervisor, or is it something in kubernetes plugin?
Good question. Probably it's going to end up in kubernetes plugin, but if possible we could bring some generic parts into the Lifecycle (in order not to repeat the same thing in many plugins).
Sometimes deployment of a job gets stuck, if for instance, Kubernetes can't create a new pod due to not enough resources in the cluster. Deployment commands succeeds (
kubectl apply
), but the real deployment process happens in background later on, and then Kubernetes states that cluster is out of memory and reports an error event. However, Racetrack doesn't know that and still waits patiently until the pod is created, until the timeout occurs. It makes the deployment proceess unnecessarily long in this case. It would be better to read the k8s events or get notified in this case, to show a meaningful error to a user and abort immediately as the error happens.