exacaster / lighter

REST API for Apache Spark on K8S or YARN
MIT License
91 stars 21 forks source link

Zombie checks #722

Open jmilkiewicz opened 12 months ago

jmilkiewicz commented 12 months ago

Looking through the code of lighter i find that there is a check for "zombie". I imagine it is done to cover the case in which session is somehow killed/terminated with no intervention from lighter itself, for example when session pod is killed manually (in case of k8s). Just wondering why there is 30 mins window: https://github.com/exacaster/lighter/blob/fc76c23499a8ea863b5105708a805df0848f7bba/server/src/main/java/com/exacaster/lighter/application/ApplicationStatusHandler.java#L88 Is 30 mins some kind of "good enough" value ? BTW 30 mins seems to be out of sync from log message you out in log storage https://github.com/exacaster/lighter/blob/fc76c23499a8ea863b5105708a805df0848f7bba/server/src/main/java/com/exacaster/lighter/application/ApplicationStatusHandler.java#L93-L94

My biggest concern is: if session pod is somehow killed (by accident, k8s scheduler or whatever) lighter will find it out after at least 30 mins. Till that time session will be in state IDLE and statements can be submitted. Effectively these statements will be never executed...

pdambrauskas commented 12 months ago

Looking through the code of lighter i find that there is a check for "zombie". I imagine it is done to cover the case in which session is somehow killed/terminated with no intervention from lighter itself, for example when session pod is killed manually (in case of k8s).

I think that is not the case, if session/batch job was killed externally, lighter would get a not found response and could mark the session/job as failed without a zombie check.

If I remember correctly (@Minutis maybe you do?), this was originally implemented to work around the cases, when kubernetes driver pod is being re-scheduled on a different node. In our cases we were running spark batch applications on AWS spot instances, and there were some cases, when the node was taken away, and its pods were re-scheduled on a different node. For some time k8s API did not return the info about the pod.

This functionality makes sense only on this specific configuration, I think we could make it configurable and allow users to switch it off.

It looks like you have an unusual use-case for the Lighter sessions, maybe it would make sense for you to try look into our "Permanent session" feature - there is a way to configure a session that should run forever, and in cases, when it dies or is killed by external forces, Lighter will recreate that session under the same ID. We didn't have time to document this feature (we'll do it eventually).

jmilkiewicz commented 12 months ago

I think that is not the case, if session/batch job was killed externally, lighter would get a not found response and could mark the session/job as failed without a zombie check.

I can make some extra check but i think you are wrong, at least for sessions. Putting session to Error state happens in 2 places:

The first case is pretty simple and happens on error after session is launched. The second case is more interesting as in checkZombie is called when lighter can NOT get ApplicationInfo from backend - in k8s when pod is gone. Pod can be gone because node on which it has been running has been restarted/removed, k8s scheduler decided to kill pod or maybe human error. No matter what was the reason, it can happen than session state is set to ApplicationState.ERROR after at least 30 mins, not earlier... I think this 30 mins is:

I did some test with lighter 0.0.50 on k8s. When i killed my session pod (kubectl delete pod ...), the state of session has been updated to 'error' after 30 mins.

I am just wondering if anything can be done with it. Of course one of option is to simply make it configurable and let users decided what shall be "zombie" time. On the other side i am just wondering if maybe it would make sense to somehow discover "error-cases" earlier and using some kind of https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/ for k8s.

My biggest concern is that if something bad happens to session (like being killed permanently or being rescheduled on another node) it can take too much time before i can find it out (via lighter rest api). If session pod is killed i can still submit statements to it which will never be executed and after 30 mins session will end up in error state.

Permanent sessions can be helpful but at some point i will need to kill them - sessions occupy too much resources and because of my trying to use custom python libraries i can not easily reuse them

pdambrauskas commented 12 months ago

Yes. It is how it works now. If the pod is killed (not deleted) it should mark it to “error” sooner, but if it is removed - it takes 30mins.

Zombie check was implemented without having sessions in mind and the value was set to 30min, when we were trying to work around the issue in our infrastructure. Something related to lenger spot instance start times and pod recreation, cant really remember what exactly.

I’m not saying it is configurable now, but we can make it configurable, in many cases it does not make sense to have this check at all, since we can treat “not found” pod as an error right away.

Minutis commented 11 months ago

Sorry for delayed response.

I think that zombie case was also happening in YARN backend too. When YARN as whole service is restarted - all information about running jobs is gone. That means if any of the batch applications were running and the YARN was restarted Lighter no longer could get any information about the jobs and marked them as a zombies after the delay.

I think it makes sense to make it configurable. It also can be handled differently in different backends but of course the code would be harder to manage since same functionality would need different implementation depending on the backend serving the jobs.