Closed tmorik closed 1 year ago
Vendor case with Red Hat was opened with the usual must-gather data package. From there, Matt Robson did note some activity involving CRD's involving Aqua that were still present which could cause minor performance issues if the cluster is attempting to process CRD's whose associated namespace/service no longer exists.
oc delete crd aquascanneraccounts.mamoa.devops.gov.bc.ca aquascanneraccounts.mamoa.devops.gov.bc.ca.devops.gov.bc.ca
The above non-impactful change has improved the stability of the mentioned containers in the controller pods, but am still seeing some container churn. Matt has provided a stand-alone pod controller binary we can run from our UTIL server that will monitor API for pod churn. The following are the instructions to obtain/run it. It leverages existing kubeconfig file for access.
git clone https://github.com/aojea/pod-controller.git
cd pod-controller
chmod +x pod-controller
Start it:
./pod-controller
Directory and specific test ran for an extended period.
[root@mcs-silver-util pod-controller]# pwd
/root/pod-controller
[root@mcs-silver-util pod-controller]# ./pod-controller -v=8 |& tee /data/2022june03-pod-controller-github-log8.txt
No "trace" events involving the API found minus the initial sync done at the start.
# grep -i trace 2022june03-pod-controller-github-log8.txt
I0603 12:34:10.148536 24328 trace.go:205] Trace[1298498081]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.1/tools/cache/reflector.go:167 (03-Jun-2022 12:33:55.770) (total time: 14377ms):
Trace[1298498081]: ---"Objects listed" 14295ms (12:34:00.066)
Trace[1298498081]: ---"Resource version extracted" 0ms (12:34:00.066)
Trace[1298498081]: ---"Objects extracted" 32ms (12:34:00.098)
Trace[1298498081]: ---"SyncWith done" 49ms (12:34:00.148)
Trace[1298498081]: ---"Resource version updated" 0ms (12:34:00.148)
Trace[1298498081]: [14.377508901s] [14.377508901s] END
Per request from Matt, had re-ran this test both on SILVER-UTIL as well as inside MCS-SILVER-MASTER-01 with a kubeconfig adjusted to point to localhost instead of api-int. Uploaded the results of both runs to the vendor case. Ball's in Red Hat's court as to what they want us to do next on this matter, remains Blocked.
The last from Matt before he went on vacation was requesting we make a change to the controller-managers in either GOLD or GOLDDR so as to not consume API services via the F5 VIP but instead using localhost. Will pursue this in GOLD since it has just been upgraded, GOLDDR will be upgraded next.
The head-stumper here is that GOLDDR is showing similar symptoms, yet from a Change perspective it's not yet had its F5 hardware refreshed in similar fashion as SILVER and GOLD. Still, this is a logical next-step for pursuing troubleshooting on this matter.
current state of GOLD post-upgrade to 4.9:
$ oc -n openshift-kube-controller-manager get pods
NAME READY STATUS RESTARTS AGE
kube-controller-manager-mcs-gold-master-01.dmz 4/4 Running 6 (26h ago) 30h
kube-controller-manager-mcs-gold-master-02.dmz 4/4 Running 5 (28h ago) 30h
kube-controller-manager-mcs-gold-master-03.dmz 4/4 Running 5 30h
revision-pruner-31-mcs-gold-master-01.dmz 0/1 Completed 0 29h
revision-pruner-31-mcs-gold-master-02.dmz 0/1 Completed 0 28h
revision-pruner-31-mcs-gold-master-03.dmz 0/1 Completed 0 29h
Removing Tatsuya and adding Tim to this ticket. Tats is going on vacation, he's welcome to rejoin if this ticket is still in play when he's back. :)
have scheduled an appropriate knowledge-sharing session for this work this Friday with the team, so we can all go over how things work regarding moving an Openshift cluster off of using the API VIP IP via the F5 and instead consume API services via localhost so the controller pods talk to their respective API pod directly. Useful for troubleshooting issues involving potential network problems or other F5-related issues.
RFC CHG0039431 now created to promote changes asked for by Matt Robson on the GOLD cluster, tentatively penciled in for July 20th 2022. RFC awaiting business owner approval and final comms to alerts channel in Rocket Chat.
above-mentioned RFC has been applied in GOLD. Since this cluster has not been showing much pod churn before the change, we will be wanting to leave GOLD as-is for at least one week if not longer.
In the meantime while we monitoring GOLD for new pod churn post-change, the next event of interest will be early morning on Wednesday July 27th 2022 when F5 stakeholders will be performing a hardware firmware update that will affect the GOLD and SILVER VCMPs. SILVER pods are still churning as well, and if after next Wednesday that changes, then this will be evidence that the cause was F5-related and also resolved by the mentioned change.
still on hold since the Wednesday firmware upgrade RFC was executed, but had to be rolled back due to issues during upgrade.
In the meantime, GOLD cluster after we had made the API consumption change seems to have stabilized, though it was also noted that SILVER, while still flapping, isn't doing it as often as before.
Will wait for the next attempt to upgrade the F5's involved to first see how that affects SILVER, and then after rolling the change back, GOLD as well. No announcements yet as to when a new attempt to upgrade firmware will be done.
This past week saw F5 stakeholders attempt a "take-2" upgrade of the Kamloops F5 hardware that also involved GOLD and SILVER. Was not successful, they had to roll-back to the original OS version.
Will schedule an RFC for next week to revert GOLD cluster's API configuration for the master controller back to using the F5 API VIP since we have proven that the work-around on GOLD resulted in improved stability. This, coupled with GOLDDR now having migrated to the new F5 hardware results in similar behavior, solidifies the fact that the new F5 hardware is a definite factor in all of this.
Added a formal "Blocked by" section to this ticket explaining why this ticket remains blocked.
Kamloops F5 firmware updated on Nov 2nd and vCMP moved to the new hardware. So far things seem stable with the kube-controller-manager. Keeping in blocked until the vCMPs are also updated to new versions.
Will refresh the Definition of Done as this has turned out to not be a Red Hat problem/solution, but rather one tied in to the F5 Hardware refresh and updates which started back in May 2022.
SILVER/openshift-config ~ $ oc -n openshift-kube-controller-manager get pods -l app=kube-controller-manager
NAME READY STATUS RESTARTS AGE
kube-controller-manager-mcs-silver-master-01.dmz 4/4 Running 6 (34d ago) 43d
kube-controller-manager-mcs-silver-master-02.dmz 4/4 Running 11 (19d ago) 43d
kube-controller-manager-mcs-silver-master-03.dmz 4/4 Running 12 (17d ago) 43d
The above shows these pods have been now stable for 2-3 weeks, which puts us just past the F5 maintenance RFC's. Closing this and the corresponding Red Hat ticket as resolved.
Describe the issue On silver cluster,
kube-controller-manager
container restarting too many timesAdditional context
Definition of done