Cray-HPE / sat

System Admin Toolkit
https://cray-hpe.github.io/docs-sat/
MIT License
4 stars 5 forks source link

CRAYSAT-1878: Remove automatic cronjob recreation from `bootsys` #244

Closed haasken-hpe closed 3 months ago

haasken-hpe commented 3 months ago

Summary and Scope

Remove the step that automatically checks for and re-creates stuck Kubernetes CronJobs from the platform-services stage of sat bootsys boot. This should not be necessary anymore starting in Kubernetes 1.21, which made a new CronJobControllerV2 the default.

In addition, improve the logic of the HMSDiscoveryScheduledWaiter, so that it will more reliably detect when an hms-discovery Job has been scheduled for the CronJob. Pass in an explicit start_time, so that we can look for any jobs created for the CronJob after it is re-enabled. This ensures we won't miss the first one, which could be scheduled between when we set suspend=False on the CronJob and when we create the HMSDiscoveryScheduledWaiter.

Issues and Related PRs

Testing

Tested on:

Test description:

Tested on rocket as follows:

Risks and Mitigations

Should be pretty low-risk. This removes functionality that has caused more problems than it solved. It can always be executed manually as documented, if needed.

Pull Request Checklist

haasken-hpe commented 3 months ago

Testing on rocket has been completed. The step that un-suspends the hms-discovery cronjob and waits for a job to be scheduled now completes very quickly in my testing thanks to the minor tweaks made here.

Before executing the sat bootsys boot --stage cabinet-power command:

ncn-m001:~ # kubectl get cronjobs -n services hms-discovery
NAME            SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
hms-discovery   */3 * * * *   True      0        8m5s            8h
ncn-m001:~ # kubectl get jobs -n services -l cronjob-name=hms-discovery
NAME                     COMPLETIONS   DURATION   AGE
...
hms-discovery-28697652   1/1           81s        9m45s
hms-discovery-28697655   0/1           8m15s      8m15s

Executing the command:

ncn-m001:~/haasken # sat bootsys boot --stage cabinet-power
INFO: Resuming cronjob hms-discovery in namespace services.
INFO: Waiting for cronjob hms-discovery in namespace services to be scheduled.
INFO: Waiting for ComputeModules in liquid-cooled cabinets to be powered on.
INFO: All ComputeModules have reached powered on state.

Looking at the cronjob and jobs afterwards:

ncn-m001:~ # kubectl get jobs -n services -l cronjob-name=hms-discovery
NAME                     COMPLETIONS   DURATION   AGE
...
hms-discovery-28697652   1/1           81s        10m
hms-discovery-28697655   0/1           9m         9m
hms-discovery-28697661   0/1           14s        14s
ncn-m001:~ # kubectl get cronjobs -n services hms-discovery
NAME            SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
hms-discovery   */3 * * * *   False     1        3m39s           8h