flux-framework / flux-k8s

Project to manage Flux tasks needed to standardize kubernetes HPC scheduling interfaces
Apache License 2.0
20 stars 10 forks source link

Passing of duration / timeout from jobs / pods to fluence #70

Open vsoch opened 3 months ago

vsoch commented 3 months ago

Currently, the default time limit for fluence is one hour, meaning that if the Kubernetes abstraction (pod, job, minicluster, etc.) has a different time, the two would not be synced. As an example, given a Kubernetes job that requires more than an hour, it might not have been cancelled by Kubernetes until hour 2. However fluence will hit the 1 hour mark (it's default) and cancel the job too early.

Another issue (that isn't scoped to fluence, but related to timing) is timeout for a Job that has pods in a group. For example, for a Job abstraction (from here):

The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded

I think this means that, given we have an MPI job that spans nodes, the timing will start when the first pod in the job is running. If there is a large delay to when the last pod is up (when the job can truly start) we don't actually get the runtime we asked for, but the runtime - the waiting time for all pods to be up. In the context of fluence, we are again not accounting for the waiting time. If the pods are quick to schedule in the group, this likely won't be an issue. But if there is some delay that comes close to the total runtime needed, we might want to mutate the time to allow for that. What seems to be a good idea, given the above, is to set a timeout that would deem the job unreasonably long running but not to make it close to the actual runtime.

For the first (simpler) issue, we basically need to pass forward any duration / time limits set on a pod or group abstraction to fluence. Discussed with @milroy today, please add any comments that I forgot.

vsoch commented 3 months ago

Also ping @cmisale in case you have comment! (Sorry meant to in the initial post and it flew out my left ear like a squirrel with his butt on fire). :fire: