Read deployment failure from Kubernetes events

iszulcdeepsense commented 5 months ago

Sometimes deployment of a job gets stuck, if for instance, Kubernetes can't create a new pod due to not enough resources in the cluster. Deployment commands succeeds (kubectl apply), but the real deployment process happens in background later on, and then Kubernetes states that cluster is out of memory and reports an error event. However, Racetrack doesn't know that and still waits patiently until the pod is created, until the timeout occurs. It makes the deployment proceess unnecessarily long in this case. It would be better to read the k8s events or get notified in this case, to show a meaningful error to a user and abort immediately as the error happens.

anders314159 commented 2 months ago

Is there a failing example for this issue?
Will $ racetrack deploy job.yaml succeed while kubectl apply runs in the kubernetes plugin? Where is the timeout?
Do we want this feature to be in lifecyle-supervisor, or is it something in kubernetes plugin?

iszulcdeepsense commented 2 months ago

@anders314159

Is there a failing example for this issue?

A sort of. In case of requesting for too many CPU cores, like this:

resources:
  cpu_min: 10M # 10M is not millis, but Mega

Kubernetes says it's okay on kubectl apply stage, but then it's unable to create a pod. So there's a way for an improvement in getting this information back to a user.

Will $ racetrack deploy job.yaml succeed while kubectl apply runs in the kubernetes plugin? Where is the timeout?

racetrack deploy job.yaml will fail, but after 15 mintues (after the timeout). The timeout comes from the liveness probe check, Lifecycle tries to check the /live endpoint of a job, but since the pod doesn't exist, it waits indefinitely until it appears: https://github.com/TheRacetrack/racetrack/blob/5bb2f98e7589e5dd549c28c21ea91ff7c29dfe8b/lifecycle/lifecycle/monitor/health.py#L36

Do we want this feature to be in lifecyle-supervisor, or is it something in kubernetes plugin?

Good question. Probably it's going to end up in kubernetes plugin, but if possible we could bring some generic parts into the Lifecycle (in order not to repeat the same thing in many plugins).

TheRacetrack / racetrack

Read deployment failure from Kubernetes events #402