Currently, SLA aware killing is only possible for prod tier tasks. Since the intention of SLA aware killing is for it to be used with only a limited subset of jobs in the cluster, it is understandable that it was approached in this way.
However, for existing clusters that don't use tiering this presents a significant challenge for enabling SLA aware killing. All jobs in the cluster would have to be recreated with a production tier attached to them and a quota would have to be added for every single role within the cluster. Furthermore, any task that would like to use a new role, would require setting a new role quota.
Given the issues outlined, I propose we add a flag that allows operators to enable SLA aware killing for non-production tasks. The flag would be disabled by default.
@shanmugh would be great to get your thoughts on this if you have some time.
Currently, SLA aware killing is only possible for prod tier tasks. Since the intention of SLA aware killing is for it to be used with only a limited subset of jobs in the cluster, it is understandable that it was approached in this way.
However, for existing clusters that don't use tiering this presents a significant challenge for enabling SLA aware killing. All jobs in the cluster would have to be recreated with a production tier attached to them and a quota would have to be added for every single role within the cluster. Furthermore, any task that would like to use a new role, would require setting a new role quota.
Given the issues outlined, I propose we add a flag that allows operators to enable SLA aware killing for non-production tasks. The flag would be disabled by default.
@shanmugh would be great to get your thoughts on this if you have some time.
I have a POC ready to be reviewed if no one is opposed to this idea: https://github.com/rdelval/aurora/commit/31bc9b4622220f360a812c7b8b66cf5c95578bfd