This PR changes our strategy for creating resources in kubernetes. Currently we create a replica set for all workers. This is a bit convenient—the RS controller will create all of the pods we need, but it's inflexible. With this change, we now manage the pods directly.
In addition, a few more changes to improve the robustness of our k8s scheduler:
We prevent container restarts (with restartPolicy: Never), as a container restart always requires the controller to handle
When deployed via helm, the controller now adds itself as the owner of the pods it creates, which ensures they're cleaned up when the controller is deleted
Added a timeout during the part of scheduling where we wait for tasks to start up; if the pod failed in this stage scheduling could be blocked indefinitely
Adds a check at the start of scheduling for whether the pipeline should be stopped, so that pipelines that have been marked as stopped are not attempted to be rescheduled after a controller restart
Also ups the default k8s resource requests to something more reasonable for production apps.
This PR changes our strategy for creating resources in kubernetes. Currently we create a replica set for all workers. This is a bit convenient—the RS controller will create all of the pods we need, but it's inflexible. With this change, we now manage the pods directly.
In addition, a few more changes to improve the robustness of our k8s scheduler:
restartPolicy: Never
), as a container restart always requires the controller to handleAlso ups the default k8s resource requests to something more reasonable for production apps.