Closed SujeethJinesh closed 4 months ago
QQ: Do we want to enable the backoff_limit based on a flag for suspend/resume?
Would the backofflimit hide any other issues and attempt to schedule the workload despite errors?
@RoshaniN We've thought about adding a flag for backoff_limit, but it would make it harder for the user to identify what they should be doing. It would be easily confusing for the user to remember to multiply the backoff limit by vms_per_slice.
Perhaps a flag like "num_retries" could be better and we automatically multiply it by the vms_per_slice? What do you think?
QQ: Do we want to enable the backoff_limit based on a flag for suspend/resume? Would the backofflimit hide any other issues and attempt to schedule the workload despite errors?
@RoshaniN We've thought about adding a flag for backoff_limit, but it would make it harder for the user to identify what they should be doing. It would be easily confusing for the user to remember to multiply the backoff limit by vms_per_slice.
Perhaps a flag like "num_retries" could be better and we automatically multiply it by the vms_per_slice? What do you think?
I like automatic the calculation of backoffLimit . I am thinking if it will mask actual issues and continue to try to schedule ?
I added some logic to exit for user code errors, but do we know what happens for non user-code errors? Mostly thinking about ImagePullBackoff and such?
The PR at https://github.com/google/xpk/pull/134 will solve the masked errors. I still need to test that out though.
Fixes / Features
Testing / Documentation
Testing details - tested on v5e slices manually.