Closed haasken-hpe closed 3 months ago
Testing on rocket has been completed. The step that un-suspends the hms-discovery cronjob and waits for a job to be scheduled now completes very quickly in my testing thanks to the minor tweaks made here.
Before executing the sat bootsys boot --stage cabinet-power
command:
ncn-m001:~ # kubectl get cronjobs -n services hms-discovery
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
hms-discovery */3 * * * * True 0 8m5s 8h
ncn-m001:~ # kubectl get jobs -n services -l cronjob-name=hms-discovery
NAME COMPLETIONS DURATION AGE
...
hms-discovery-28697652 1/1 81s 9m45s
hms-discovery-28697655 0/1 8m15s 8m15s
Executing the command:
ncn-m001:~/haasken # sat bootsys boot --stage cabinet-power
INFO: Resuming cronjob hms-discovery in namespace services.
INFO: Waiting for cronjob hms-discovery in namespace services to be scheduled.
INFO: Waiting for ComputeModules in liquid-cooled cabinets to be powered on.
INFO: All ComputeModules have reached powered on state.
Looking at the cronjob and jobs afterwards:
ncn-m001:~ # kubectl get jobs -n services -l cronjob-name=hms-discovery
NAME COMPLETIONS DURATION AGE
...
hms-discovery-28697652 1/1 81s 10m
hms-discovery-28697655 0/1 9m 9m
hms-discovery-28697661 0/1 14s 14s
ncn-m001:~ # kubectl get cronjobs -n services hms-discovery
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
hms-discovery */3 * * * * False 1 3m39s 8h
Summary and Scope
Remove the step that automatically checks for and re-creates stuck Kubernetes CronJobs from the
platform-services
stage ofsat bootsys boot
. This should not be necessary anymore starting in Kubernetes 1.21, which made a new CronJobControllerV2 the default.In addition, improve the logic of the HMSDiscoveryScheduledWaiter, so that it will more reliably detect when an
hms-discovery
Job has been scheduled for the CronJob. Pass in an explicitstart_time
, so that we can look for any jobs created for the CronJob after it is re-enabled. This ensures we won't miss the first one, which could be scheduled between when we setsuspend=False
on the CronJob and when we create theHMSDiscoveryScheduledWaiter
.Issues and Related PRs
Testing
Tested on:
Test description:
Tested on rocket as follows:
hms-discovery
CronJobsat bootsys boot --stage cabinet-power
Risks and Mitigations
Should be pretty low-risk. This removes functionality that has caused more problems than it solved. It can always be executed manually as documented, if needed.
Pull Request Checklist