BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

Research monitoring options for improper KNP policies #1321

Closed mitovskaol closed 2 years ago

mitovskaol commented 3 years ago

Describe the issue When a KNP uses a pod selector to allow connections the pods and when there is a large number of individual pods in a single namespace that match the specified label, it can lead to a large number of OVS flow rules generated which may slow down a pod. This is a know issue and is described in this RedHat article.

This issue occurred in the Silver cluster in early June and manifested itself as high CPU/RAM consumption in SDN pods and intermittent connectivity issues between pods in some namespaces. The temporary workaround was to restart SDN pods twice a day. The team responsible for creating the issue was contacted and adjusted their cron jobs to properly clean up temporarily created pods after an ETL job run.

However, we need to have a monitoring set up in Silver to proactively monitor for these issues to catch them early.

BLOCKED UNTIL:

Additional context Red Hat Enhancement request ticket: https://issues.redhat.com/browse/SDN-1960 Red Hat Original case: https://access.redhat.com/support/cases/#/case/02954595

Definition of done

matthieu-foucault commented 3 years ago

Our assumption when writing network policies (and previously Aporeto's network security policies) were that the more precise the policy was, the better, so we decided to create policies that rely on a label, e.g.

spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: cas-ggircs
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: cas-ggircs

Those are the potentially problematic policies as they will generate an exponential amount of OVS rules according to the RedHat article. The risk created by this exponential complexity materialized when we accumulated a large number (~300) of pods, most of them being completed jobs.

Finding those policies that use matchLabels, and finding how many pods do match the label would be one way to detect potential issues.

Finding the namespaces with the most completed pods (that was probably ours :sweat_smile:) and explaining to teams that completed pods do have an impact on the whole platforms' performance (we assumed that the impact would be negligible given that the pods were not running), would be another suggestion.

The steps taken to remediate this in our team will be to: 1) in the short term (within the next 3 weeks), ensure that completed pods are deleted so that we keep a low number of pods (running or completed) 2) in the longer term, revise our network policies. The RedHat documentation suggest relying on namespaces to solve this issue, which in our case is almost equivalent to the current setup (one of our namespaces hosts two applications, but otherwise we have one application per namespace)

mrobson commented 3 years ago

There is nothing "simple" to monitor for this...

A few thoughts I have, which I need to do some more investigation on:

a) Find the current total number of OVS flows and determine the number of flows per project

b) Try to find a relatively simple way to monitor the number of OVS flows periodically

c) Find and evaluate current NPs using podselectors and try if they may lead to issues

mitovskaol commented 3 years ago

@mrobson @wmhutchison It is great to know that the current issue with the SDN pods have been fixed but lets continue working on setting up monitoring in place to keep an eye on the number of the OVS flows so that we can be proactive in the future and fix the problem before it gets out of hand and starts impacting the Platform perfomance.

mrobson commented 3 years ago

We have opened https://issues.redhat.com/browse/SDN-1960 on our side to make some logging and debugging improvements so its much easier to detect this type of problem in the future.

mitovskaol commented 3 years ago

This is excellent! Thank you @mrobson I will keep this ticket open until we get the recommendations back from RedHat and will capture them here.

wmhutchison commented 2 years ago

Per notes from case https://access.redhat.com/support/cases/#/case/02954595

Command:

for P in `oc get pods -l app=ovs -o custom-columns=POD:.metadata.name --no-headers`; do echo -n "$P\n"; oc exec -n openshift-sdn "$P" -- ovs-ofctl -O OpenFlow13 dump-flows br0 | grep -o 'reg0=[^,]*' | cut -d = -f 2 | sort | uniq -c | sort -d > "$P"-reg0-flow_count.out; done

for P in `oc get pods -l app=ovs -o custom-columns=POD:.metadata.name --no-headers`; do echo -n "$P\n"; oc exec -n openshift-sdn "$P" -- ovs-ofctl -O OpenFlow13 dump-flows br0 | grep -o 'reg1=[^,]*' | cut -d = -f 2 | sort | uniq -c | sort -d > "$P"-reg1-flow_count.out; done

Matt

mitovskaol commented 2 years ago

Thank you for the commans @mrobson @wmhutchison Can we turn these into cron job that will alert the Ops team if one of these problematic KNPs is encountered?

wmhutchison commented 2 years ago

Going to revisit this from the original perspective of the events in question as they happened. The queries to obtain reg0/reg1 namespace counts is in itself not going to work as a viable monitoring method for this scenario since we have zero idea if the high counts relate to a single NetworkPolicy object generating a lot of ovs flows on its own, or the summation of a larger number of NetworkPolicy objects adding up for the similar count.

This is why from this side of things, Red Hat's proposed enhancement to add logging for the original use case remains the best viable option for this, but for the time being, we'll go back to doing monitoring on ovs pods regarding CPU and memory consumptions, since those creep up (particularly the memory) for the scenario we're trying to capture.

wmhutchison commented 2 years ago

Some quick and dirty PromQL that'll give us the basics for SDN pod memory.

sum(container_memory_working_set_bytes{namespace='openshift-sdn',pod=~'sdn-.....'}) BY (pod, namespace)
wmhutchison commented 2 years ago

Copy/pasting an existing AlertManager PromQL that's in use that may prove to be helpful in terms of setting up what I want (alert if any target pod exceeds a defined threshold of RAM - do not want to be alerted on all affected pods, just if any of them go over the limit).

sum(max by(namespace, container, pod) (increase(kube_pod_container_status_restarts_total[12m])) and max by(namespace, container, pod) (kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}) == 1) > 5
wmhutchison commented 2 years ago

Got it. This will work as expected, the example being list all SDN pods using 5GB or more of memory.

sum(container_memory_working_set_bytes{namespace='openshift-sdn',pod=~'sdn-.....'}) BY (pod, namespace) > (5*1024*1024*1024)
wmhutchison commented 2 years ago

Link for full 2 week Prometheus graph for SDN pods going over 4GB RAM. Get regular hits for 3GB. Will need to find out if 4GB is high enough or if we should make it 5GB.

https://prometheus-k8s-openshift-monitoring.apps.silver.devops.gov.bc.ca/graph?g0.range_input=2w&g0.expr=sum(container_memory_working_set_bytes%7Bnamespace%3D%27openshift-sdn%27%2Cpod%3D~%27sdn-.....%27%7D)%20BY%20(pod%2C%20namespace)%20%3E%20(4*1024*1024*1024)&g0.tab=0

wmhutchison commented 2 years ago

Starting to wade through existing setups of Nagios monitoring as done via platform-tools to see what's going to be involved to create the monitor for this particular event as well. The premise being that if the query returns any results, that's to be alerted on (WARN/CRITICAL depending on the threshold for RAM usage), then act on it. No results returned means all is good.

wmhutchison commented 2 years ago

Should be able to resume work on this in Sprint 35.

wmhutchison commented 2 years ago

Got some cycles now to put into this task, but it will likely carry into Sprint 36 based on remaining time available to work on it and other initiatives needing attention.

Revisiting current Nagios monitoring setup so new branch/PR can be created to test this out. May be also an opportunity to learn/test Steven's recent work for allowing podman testing in our IDIR_A account spaces on the UTIL servers.

wmhutchison commented 2 years ago

have confirmed the previous PromQL query still suffices (list SDN pods consuming 4GB or more memory) as a threshold for Nagios alerting. No hits for the full span of Prometheus data retention. Will proceed with work building out the Nagios monitor.

wmhutchison commented 2 years ago

https://github.com/bcgov-c/platform-tools/blob/ocp4-base/ocp4/nagios/runner/project/roles/nagios/tasks/monitoring.yaml contains all of the special stuff for adding a new monitor as well as leveraging Thanos for grabbing the data we need.

Zero problems with tacking on the additional query, but will likely need to do some testing/experiments in terms of how data is returned for our new query, will need to run some tests first to grab data and re-parse into something that can be used by Nagios.

wmhutchison commented 2 years ago

Have finished testing with Proof of Concept Thanos playbook so the query output can be inspected as to how to parse it for presentation to Nagios. Will now carve out a new branch/PR and start work on updating Nagios image/pod to add the new monitor. Will test in KLAB or CLAB first as per usual.

wmhutchison commented 2 years ago

https://github.com/bcgov-c/platform-tools/pull/76 created now with initial updates. Next step is to test it out in a LAB environment, will likely need to mess around with threshold settings first to trigger it in LAB.

wmhutchison commented 2 years ago

CLAB was the guinea pig chosen for testing the new Nagios monitor. Had to adjust thresholds since LAB SDN pod resource consumption is a lot less than SILVER.

Will revisit PR and make final threshold adjustments before promoting the PR for review/approval.

wmhutchison commented 2 years ago

While it will be close, am fairly confident this ticket will be finished off by the conclusion of Sprint 35. If for some reason it carries over to Sprint 36, it likely won't be for long.

wmhutchison commented 2 years ago

New issue to resolve - while reviewing the documentation on troubleshooting NetworkPolicy issues, it was discovered that some changes under the hood in OCP 4.7 has made previous commands used for troubleshooting no longer valid.

The data does still exist, there's just no longer a dedicated OVS pod, but SDN pod only, which is where the OVS data resides. Will review previous commands used, and adjust their syntax as needed.

wmhutchison commented 2 years ago

The noted issue will definitely cause this ticket to be moved into Sprint 36 before fully resolved, but shouldn't take up a ton of time.

wmhutchison commented 2 years ago

https://github.com/bcgov-c/advsol-docs/pull/172 created for documenting the new monitor and how to service it while on-call

wmhutchison commented 2 years ago

Blocked for now until PR's for new Nagios monitor and matching docs for on-call team to service the monitor are borth reviewed and approved. Once that's done and code/docs are merged, we can push out the new Nagios monitor across all OCP clusters.

wmhutchison commented 2 years ago

Blocks addressed, moving back into Progress. This ticket will be completed today at the latest.

wmhutchison commented 2 years ago

Confirmed in Nagios that Steven Barre has pushed out the new monitor to all clusters. Closing off this ticket as complete.