The cc_deployment_updater job has an incorrect readiness probe; when it is in HA, it can get jammed and never become ready (see below). Rather than properly fixing it (by moving it into a separate instance group and marking it as active/passive), just drop the job completely as it does not appear to be necessary.
Motivation and Context
(Partially) fixes #1589 — we'll also want to pull in #1632 for upgrades.
How Has This Been Tested?
Finished 🐱 locally.
Types of changes
[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
Checklist:
[ ] My code has security implications.
[x] My code follows the code style of this project.
[ ] My change requires a change to the documentation.
[ ] I have updated the documentation accordingly.
Additional Information:
Details on the readiness probe problems:
The probe was unaware that it was supposed to be active/passive, and just checks the logs to ensure all instances are sleeping once every five seconds. This only happened to work on passive nodes because the test incorrectly succeeded when an instance has never been ready (because jq happily accepted empty input and returned success). This means that if a pod was ever active, then turned passive again (e.g. because it was being restarted and another instance became active), it would have a stale line that matched the probe and therefore start getting marked as failed.
Description
The
cc_deployment_updater
job has an incorrect readiness probe; when it is in HA, it can get jammed and never become ready (see below). Rather than properly fixing it (by moving it into a separate instance group and marking it as active/passive), just drop the job completely as it does not appear to be necessary.Motivation and Context
(Partially) fixes #1589 — we'll also want to pull in #1632 for upgrades.
How Has This Been Tested?
Finished 🐱 locally.
Types of changes
Checklist:
Additional Information:
Details on the readiness probe problems:
The probe was unaware that it was supposed to be active/passive, and just checks the logs to ensure all instances are sleeping once every five seconds. This only happened to work on passive nodes because the test incorrectly succeeded when an instance has never been ready (because
jq
happily accepted empty input and returned success). This means that if a pod was ever active, then turned passive again (e.g. because it was being restarted and another instance became active), it would have a stale line that matched the probe and therefore start getting marked as failed.