BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

Add Toleration for Openshift Compliance scan pods to run on AI-dedicated nodes #4990

Open wmhutchison opened 1 month ago

wmhutchison commented 1 month ago

Describe the issue AlertManager on SILVER reported three pods belonging to the openshift-compliance namespace as being currently stuck in Pending due to unsatisfied taints. The nodes in question relate to the nodes currently set aside for AI testing, thus have a custom taint to ensure regular workloads will not start on those nodes.

This ticket will track suitable changes made to the CCM manifest to allow the openshift-compliance pods to start up on these affected nodes.

Additional context The fact that this issue was not noticed until now informs us how rarely the involved pods in the openshift-compliance namespace are restarted, since otherwise this issue would have been brought to our attention much sooner.

How does this benefit the users of our platform?

Definition of done

wmhutchison commented 1 month ago

https://github.com/bcgov-c/platform-gitops-gen/blob/master/roles/compliance-op/templates/scansetting.default.yaml.j2#L17 already exists in CCM and has an existing tolerance for allowing master nodes to stand up scan pods. Just need to expand on this for the AI nodes.

wmhutchison commented 1 month ago

Attempted a Proof of Concept fix on KLAB by pausing CCM and adjusting the default named ScanSetting resource which is what vendor docs states has a top-level control over adding new tolerances. Was not able to force new pods to be created to inherit the new tolerance.

Revisited the situation in SILVER. The issue took place there not because each node has a pod from this namespace running on it, but instead a new workload for a scan just happened to land on one of the AI-dedicated nodes.

Given this fact and that the current AI nodes are expected to be handed back for general use sometime in the near future, this ticket will be shelved in Backlog for now, and perhaps revisited in the future when more available time can be sunk into it, assuming the team agrees this is still worth investing time into.