AI-Hypercomputer / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81 stars 23 forks source link

Correct Suspend/Resume backoffLimit for Pathways #157

Closed SujeethJinesh closed 4 months ago

SujeethJinesh commented 4 months ago

Fixes / Features

Testing / Documentation

Testing details - tested on v5e slices manually.

SujeethJinesh commented 4 months ago

QQ: Do we want to enable the backoff_limit based on a flag for suspend/resume?

Would the backofflimit hide any other issues and attempt to schedule the workload despite errors?

@RoshaniN We've thought about adding a flag for backoff_limit, but it would make it harder for the user to identify what they should be doing. It would be easily confusing for the user to remember to multiply the backoff limit by vms_per_slice.

Perhaps a flag like "num_retries" could be better and we automatically multiply it by the vms_per_slice? What do you think?

RoshaniN commented 4 months ago

QQ: Do we want to enable the backoff_limit based on a flag for suspend/resume? Would the backofflimit hide any other issues and attempt to schedule the workload despite errors?

@RoshaniN We've thought about adding a flag for backoff_limit, but it would make it harder for the user to identify what they should be doing. It would be easily confusing for the user to remember to multiply the backoff limit by vms_per_slice.

Perhaps a flag like "num_retries" could be better and we automatically multiply it by the vms_per_slice? What do you think?

I like automatic the calculation of backoffLimit . I am thinking if it will mask actual issues and continue to try to schedule ?

I added some logic to exit for user code errors, but do we know what happens for non user-code errors? Mostly thinking about ImagePullBackoff and such?

SujeethJinesh commented 4 months ago

The PR at https://github.com/google/xpk/pull/134 will solve the masked errors. I still need to test that out though.