Research monitoring options for improper KNP policies

mitovskaol commented 3 years ago

Describe the issue When a KNP uses a pod selector to allow connections the pods and when there is a large number of individual pods in a single namespace that match the specified label, it can lead to a large number of OVS flow rules generated which may slow down a pod. This is a know issue and is described in this RedHat article.

This issue occurred in the Silver cluster in early June and manifested itself as high CPU/RAM consumption in SDN pods and intermittent connectivity issues between pods in some namespaces. The temporary workaround was to restart SDN pods twice a day. The team responsible for creating the issue was contacted and adjusted their cron jobs to properly clean up temporarily created pods after an ETL job run.

However, we need to have a monitoring set up in Silver to proactively monitor for these issues to catch them early.

BLOCKED UNTIL:

[x] PR for Nagios monitoring addition is reviewed/approved for merge. https://github.com/bcgov-c/platform-tools/pull/76
[x] PR for on-call documentation to match new monitor is reviewed/approved. https://github.com/bcgov-c/advsol-docs/pull/172

Additional context Red Hat Enhancement request ticket: https://issues.redhat.com/browse/SDN-1960 Red Hat Original case: https://access.redhat.com/support/cases/#/case/02954595

Definition of done

[x] design and document the approach that can be taken to find namespaces/KNPs that can potentially overload SDN pods.
[x] Create a relevant Nagios monitor for resource tracking of SDN pods (memory monitoring is current approach, list all pods above a specific threshold)
[x] Test/QA the new monitor in CLAB to confirm it works as advertised.
[x] Create appropriate documentation for on-call staff to reliable make proper use of this new monitor/alert
[x] Ensure new monitor is pushed out to all clusters.

matthieu-foucault commented 3 years ago

Our assumption when writing network policies (and previously Aporeto's network security policies) were that the more precise the policy was, the better, so we decided to create policies that rely on a label, e.g.

spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: cas-ggircs
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: cas-ggircs

Those are the potentially problematic policies as they will generate an exponential amount of OVS rules according to the RedHat article. The risk created by this exponential complexity materialized when we accumulated a large number (~300) of pods, most of them being completed jobs.

Finding those policies that use matchLabels, and finding how many pods do match the label would be one way to detect potential issues.

Finding the namespaces with the most completed pods (that was probably ours :sweat_smile:) and explaining to teams that completed pods do have an impact on the whole platforms' performance (we assumed that the impact would be negligible given that the pods were not running), would be another suggestion.

The steps taken to remediate this in our team will be to: 1) in the short term (within the next 3 weeks), ensure that completed pods are deleted so that we keep a low number of pods (running or completed) 2) in the longer term, revise our network policies. The RedHat documentation suggest relying on namespaces to solve this issue, which in our case is almost equivalent to the current setup (one of our namespaces hosts two applications, but otherwise we have one application per namespace)

mrobson commented 3 years ago

There is nothing "simple" to monitor for this...

A few thoughts I have, which I need to do some more investigation on:

a) Find the current total number of OVS flows and determine the number of flows per project

b) Try to find a relatively simple way to monitor the number of OVS flows periodically

c) Find and evaluate current NPs using podselectors and try if they may lead to issues

mitovskaol commented 3 years ago

@mrobson @wmhutchison It is great to know that the current issue with the SDN pods have been fixed but lets continue working on setting up monitoring in place to keep an eye on the number of the OVS flows so that we can be proactive in the future and fix the problem before it gets out of hand and starts impacting the Platform perfomance.

mrobson commented 3 years ago

We have opened https://issues.redhat.com/browse/SDN-1960 on our side to make some logging and debugging improvements so its much easier to detect this type of problem in the future.

mitovskaol commented 3 years ago

This is excellent! Thank you @mrobson I will keep this ticket open until we get the recommendations back from RedHat and will capture them here.