kube-controller-manager container restarting a lot

tmorik commented 2 years ago

Describe the issue On silver cluster, kube-controller-manager container restarting too many times

Additional context

RH Case : https://access.redhat.com/support/cases/#/case/03234987

Definition of done

[x] Open a case to RH
[x] Investigate it with RH and find a solution. While OCP 4.10 upgrade improved this, it did not resolve it.
[x] Recent successful F5 OS upgrades leans towards the root cause being the F5 hardware and older OS version.
[x] Confirm target pods remain stable once VCMP instances for all the PROD clusters are upgraded.

wmhutchison commented 2 years ago

Vendor case with Red Hat was opened with the usual must-gather data package. From there, Matt Robson did note some activity involving CRD's involving Aqua that were still present which could cause minor performance issues if the cluster is attempting to process CRD's whose associated namespace/service no longer exists.

oc delete crd aquascanneraccounts.mamoa.devops.gov.bc.ca aquascanneraccounts.mamoa.devops.gov.bc.ca.devops.gov.bc.ca

wmhutchison commented 2 years ago

The above non-impactful change has improved the stability of the mentioned containers in the controller pods, but am still seeing some container churn. Matt has provided a stand-alone pod controller binary we can run from our UTIL server that will monitor API for pod churn. The following are the instructions to obtain/run it. It leverages existing kubeconfig file for access.

git clone https://github.com/aojea/pod-controller.git
cd pod-controller
chmod +x pod-controller

Start it:
./pod-controller

wmhutchison commented 2 years ago

Directory and specific test ran for an extended period.

[root@mcs-silver-util pod-controller]# pwd
/root/pod-controller
[root@mcs-silver-util pod-controller]# ./pod-controller -v=8 |& tee /data/2022june03-pod-controller-github-log8.txt

No "trace" events involving the API found minus the initial sync done at the start.

# grep -i trace 2022june03-pod-controller-github-log8.txt
I0603 12:34:10.148536   24328 trace.go:205] Trace[1298498081]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.1/tools/cache/reflector.go:167 (03-Jun-2022 12:33:55.770) (total time: 14377ms):
Trace[1298498081]: ---"Objects listed" 14295ms (12:34:00.066)
Trace[1298498081]: ---"Resource version extracted" 0ms (12:34:00.066)
Trace[1298498081]: ---"Objects extracted" 32ms (12:34:00.098)
Trace[1298498081]: ---"SyncWith done" 49ms (12:34:00.148)
Trace[1298498081]: ---"Resource version updated" 0ms (12:34:00.148)
Trace[1298498081]: [14.377508901s] [14.377508901s] END

wmhutchison commented 2 years ago

Per request from Matt, had re-ran this test both on SILVER-UTIL as well as inside MCS-SILVER-MASTER-01 with a kubeconfig adjusted to point to localhost instead of api-int. Uploaded the results of both runs to the vendor case. Ball's in Red Hat's court as to what they want us to do next on this matter, remains Blocked.

wmhutchison commented 2 years ago

The last from Matt before he went on vacation was requesting we make a change to the controller-managers in either GOLD or GOLDDR so as to not consume API services via the F5 VIP but instead using localhost. Will pursue this in GOLD since it has just been upgraded, GOLDDR will be upgraded next.

The head-stumper here is that GOLDDR is showing similar symptoms, yet from a Change perspective it's not yet had its F5 hardware refreshed in similar fashion as SILVER and GOLD. Still, this is a logical next-step for pursuing troubleshooting on this matter.

wmhutchison commented 2 years ago

current state of GOLD post-upgrade to 4.9:

$ oc -n openshift-kube-controller-manager get pods
NAME                                             READY   STATUS      RESTARTS      AGE
kube-controller-manager-mcs-gold-master-01.dmz   4/4     Running     6 (26h ago)   30h
kube-controller-manager-mcs-gold-master-02.dmz   4/4     Running     5 (28h ago)   30h
kube-controller-manager-mcs-gold-master-03.dmz   4/4     Running     5             30h
revision-pruner-31-mcs-gold-master-01.dmz        0/1     Completed   0             29h
revision-pruner-31-mcs-gold-master-02.dmz        0/1     Completed   0             28h
revision-pruner-31-mcs-gold-master-03.dmz        0/1     Completed   0             29h

wmhutchison commented 2 years ago

Removing Tatsuya and adding Tim to this ticket. Tats is going on vacation, he's welcome to rejoin if this ticket is still in play when he's back. :)

wmhutchison commented 2 years ago

have scheduled an appropriate knowledge-sharing session for this work this Friday with the team, so we can all go over how things work regarding moving an Openshift cluster off of using the API VIP IP via the F5 and instead consume API services via localhost so the controller pods talk to their respective API pod directly. Useful for troubleshooting issues involving potential network problems or other F5-related issues.

wmhutchison commented 2 years ago

RFC CHG0039431 now created to promote changes asked for by Matt Robson on the GOLD cluster, tentatively penciled in for July 20th 2022. RFC awaiting business owner approval and final comms to alerts channel in Rocket Chat.

wmhutchison commented 2 years ago

above-mentioned RFC has been applied in GOLD. Since this cluster has not been showing much pod churn before the change, we will be wanting to leave GOLD as-is for at least one week if not longer.

In the meantime while we monitoring GOLD for new pod churn post-change, the next event of interest will be early morning on Wednesday July 27th 2022 when F5 stakeholders will be performing a hardware firmware update that will affect the GOLD and SILVER VCMPs. SILVER pods are still churning as well, and if after next Wednesday that changes, then this will be evidence that the cause was F5-related and also resolved by the mentioned change.

wmhutchison commented 2 years ago

still on hold since the Wednesday firmware upgrade RFC was executed, but had to be rolled back due to issues during upgrade.

In the meantime, GOLD cluster after we had made the API consumption change seems to have stabilized, though it was also noted that SILVER, while still flapping, isn't doing it as often as before.

Will wait for the next attempt to upgrade the F5's involved to first see how that affects SILVER, and then after rolling the change back, GOLD as well. No announcements yet as to when a new attempt to upgrade firmware will be done.

wmhutchison commented 2 years ago

This past week saw F5 stakeholders attempt a "take-2" upgrade of the Kamloops F5 hardware that also involved GOLD and SILVER. Was not successful, they had to roll-back to the original OS version.

Will schedule an RFC for next week to revert GOLD cluster's API configuration for the master controller back to using the F5 API VIP since we have proven that the work-around on GOLD resulted in improved stability. This, coupled with GOLDDR now having migrated to the new F5 hardware results in similar behavior, solidifies the fact that the new F5 hardware is a definite factor in all of this.

wmhutchison commented 2 years ago

Added a formal "Blocked by" section to this ticket explaining why this ticket remains blocked.

StevenBarre commented 2 years ago

Kamloops F5 firmware updated on Nov 2nd and vCMP moved to the new hardware. So far things seem stable with the kube-controller-manager. Keeping in blocked until the vCMPs are also updated to new versions.

wmhutchison commented 2 years ago

Will refresh the Definition of Done as this has turned out to not be a Red Hat problem/solution, but rather one tied in to the F5 Hardware refresh and updates which started back in May 2022.

wmhutchison commented 1 year ago

SILVER/openshift-config ~ $ oc -n openshift-kube-controller-manager get pods -l app=kube-controller-manager
NAME                                               READY   STATUS    RESTARTS       AGE
kube-controller-manager-mcs-silver-master-01.dmz   4/4     Running   6 (34d ago)    43d
kube-controller-manager-mcs-silver-master-02.dmz   4/4     Running   11 (19d ago)   43d
kube-controller-manager-mcs-silver-master-03.dmz   4/4     Running   12 (17d ago)   43d

The above shows these pods have been now stable for 2-3 weeks, which puts us just past the F5 maintenance RFC's. Closing this and the corresponding Red Hat ticket as resolved.

BCDevOps / developer-experience

kube-controller-manager container restarting a lot #2730