BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

SDN - Refactor how Nagios Ingress Checks are handled for KLAB2 and EMERALD #3796

Closed wmhutchison closed 1 year ago

wmhutchison commented 1 year ago

Describe the issue At present all of the Ingress checks from Nagios Active are being done by hitting https://apps..devops.gov.bc.ca/ and looking for an "503 Service Unavailable" as well as "Application is not available".

This check only works on regular Openshift clusters by virtue of how the Openshift ingress router pods work in comparison to the NSX AVI front-end. We thus want to ideally pick a new URL that will be used instead and can be used on all Openshift clusters regardless whether or not we are talking NSX-backed or not. A possible suggestion is to make use of the existing Nagios nginx test pod and set up a suitable test that will generate a consistent response across all clusters.

Additional context Fulfilling this ticket will take care of the following Nagios checks.

How does this benefit the users of our platform? Ensuring Ingress services are working as expected on all clusters, including NSX-backed clusters.

Definition of done

wmhutchison commented 1 year ago

Not started yet. Need to first come to a consensus whether or not for these checks we switch across the board to using Nagios' test nginx container or use something else.

wmhutchison commented 1 year ago

Increased priority for this ticket will be provided for the Sprint starting May 25th.

wmhutchison commented 1 year ago

With the demands of OCP 4.12 and associated NSX patching now out of this way, can put more cycles into this.

wmhutchison commented 1 year ago

Based on what I can see, existing firewall rules will allow Nagios to hit any destination URL we want to check, so long as it's on port 443.

wmhutchison commented 1 year ago

Proxy_Ingress checks will probably need to be dropped entirely, since the current architecture of STMS, CORP and SDN/NSX, the kamproxy/calproxy systems are unable to resolve hostnames managed by AVI DNS. It's for this reason why we have to setup a proxy exception in our SAG browsers if we want to hit any URL managed by AVI DNS.

wmhutchison commented 1 year ago

Proxy_Ingress will need a re-design for KLAB2/EMERALD since by design the CORP web proxies involved cannot reach the destination private IP ranges.

Speaking with Steven Barre during a recent meeting on this, we can work around this by refactoring this check on KLAB2/.EMERALD to be done via an issuance of a public IP address. Will look into implementing this and seeing if a new generic ticket to DXCAS SAM team will be needed as well for accessing this or not. In the meantime, will start looking at the Github repo for managing these Nagios checks and see how easy or not it will be to add either a check implemented differerently depending on cluster, or creating a unique/different check for just the NSX clusters.

wmhutchison commented 1 year ago

Ingress check will need some adjustment due to how it was configured. Currently not-working for NSX because of the differerence in content size for the Nagios nginx pod versus the content presented by the Openshift ingress pods.

Thus we need to change

--pagesize 2048:8192

to

--pagesize 200:8192

Will make this change manually for the KLAB2 check to first confirm.

wmhutchison commented 1 year ago

I have also changed the route for nginx-insecure so that it will issue a public IP, needed in order for the CORP web proxy be able to reach the host. since this route is already in use with other Nagios Active checks, waiting first to make sure this doesn't cause breakage elsewhere first.

wmhutchison commented 1 year ago

Gave the insecure route a public IP, re-tested the Proxy__Ingress check for KLAB2 after updating that check to use the proper/new route. Still no dice. Assuming it's being blocked in NSX by a rule that's not logging itself, since not seeing anything in regular firewall logs for this. Will follow-up on Monday.

StevenBarre commented 1 year ago

We never put in the SIS rule for port 80 since "no prod URL should be using non-https". But then, since we can't user the default hostname either ... not sure how to deal with a TLS cert ...

wmhutchison commented 1 year ago

I can hit the Cerberus KLAB2 URL (converted to HTTP so that Shelly didn't have to buy a cert for just a True/False response URL) from my home connection (http://cerberus.klab2.developer.gov.bc.ca/) so SiS seems to be opened up for port 80, or at least via the VIP range(s) on KLAB2. Tried the EMERALD Cerberus HTTP address, also works.

So SiS is currently allowing port 80 through, perhaps only for specific VIP ranges.

StevenBarre commented 1 year ago

Huh, guess my info is out of date then. Nevermind me! Good luck on your debugging efforts

wmhutchison commented 1 year ago

After a session with Dan Deane to go over all of this, the Ingress check now works. DXCAS Nagios is covered by NSX, but a separate Corporate NSX instance, meaning the logs it was generating was not visible to me or other team members in the BC Gov VRLI web portal. Dan thus added a DFW entry to the CORP NSX instance to fix.

wmhutchison commented 1 year ago

Proper unit testing certainly helps as well for the last check here (Proxy_Ingress). Used my personal VirtualBox VM which has pre-baked the use of the involved CORP web proxies. Found out, or more accurately, was reminded that URLs managed by AVI whose DNS is also handled by AVI, will not be able to be resolved by the CORP proxy servers, since they're not able to reach AVI DNS, and it's doubtful such access would be allowed.

Thus we will need to treat this in the same fashion as we ask our users, in terms of setting up a "vanity domain". The downside here is that setting up Nagios on a new NSX-backed cluster isn't 100% automatic (yet) since an NNR entry will need to be created to handle this specific scenario.

wmhutchison commented 1 year ago

Alright, automation time. Copy/pasted out the Nagios command line for both Ingress and Proxy_Ingress for both KLAB and KLAB2.

Would be fine with going with variable definition/substitutions for this if it was just the testing URL that changes, but regular tests are being done against the Openshift internal router pods, which give distinct/unique responses compared to what's available to us otherwise.

  1. Ingress/Proxy_Ingress will continue to be created for regular Openshift clusters. They will be skipped for NSX-aware clusters.
  2. NSX_Ingress/NSX_Proxy_Ingress will be created net-new only for NSX-aware clusters. Regular Openshift clusters will not create these.
  3. Since the proxy Ingress check needs a public-IP using a vanity domain URL, we can re-use the same URL for the Ingress check, thus resulting in automation only creating one new route YAML for KLAB2/EMERALD.
wmhutchison commented 1 year ago

Based on technical limitations around some Nagios checks as well as how AVI renders some of the routes, the following routes will be needed.

  1. nginx-openshift-bcgov-nagios.apps.\<cluster>.devops.gov.bc.ca. This is the stock route which is available in all Openshift clusters for Nagios checks. For NSX-backed clusters, this check is appropriate for HTTPS checks where a private VIP still works.
  2. nginx-insecure-openshift-bcgov-nagios.apps.\<cluster>.devops.gov.bc.ca. This route specifically only supports HTTP, but is still a private VIP. This is required because of how AVI renders its various virtual services. Regular Openshift could use the previous route for HTTP tests, but that's not permissible in AVI since the results are different IPs and a DNS entry can only point to one of them reliably.
  3. nginx-public-nagios.\<cluster>.devops.gov.bc.ca. This is a public VIP using an internal "vanity domain" format, meaning we need to manually create a DNS entry to point to the assigned IP once route is created. Needed in order to allow Proxy_Ingress check to function.

It's possible to try and be more efficient with number of routes by putting more checks on the last route, since it supports both HTTP and HTTPS queries. We however want to avoid the actual usage of public VIPs here unless absolutely necessary.

wmhutchison commented 1 year ago

For the sake of this ticket, will take https://github.com/bcgov-c/platform-tools/blob/main/nagios-speed/nginx-speed.yaml and add a separate YAML for just NSX clusters.

wmhutchison commented 1 year ago

Need to check inside the Nagios playbooks to see what's already there for cluster detection, seems like nagios_cluster gets set based on the ConfigMap value for NAGIOS_CLUSTER which will suffice for just KLAB2/EMERALD decision making here. If we end up with more NSX clusters, then we might end up baking an NSX variable into the configmap.

wmhutchison commented 1 year ago

Noted how the Nagios container is building out the various checks, it compiles a list of checks into a single fact. Will take this and split things up into three categories.

Will then compile the final facts var based on which type of cluster we're running from. This also means we don't have to customize the monitoring names in Nagios.

It also works towards making people adding new Nagios checks to think about how they work (do they work on all clusters, or just NSX and/or regular Openshift? The answer to that question tells the author where to place their new check).

This works well for ensuring we don't need to remove un-needed checks, since at present the image when re-ran in KLAB2/EMERALD will try and add the PostGress DB checks meant to QA the F5 CIS stuff.

wmhutchison commented 1 year ago

https://github.com/bcgov-c/platform-tools/pull/159 now exists in draft, continues to be a WIP. Once done, will test on one of the regular LAB clusters first, then jump over to KLAB2 before taking the resulting work out of Draft.

wmhutchison commented 1 year ago

barring bug-fixes, applied the last of the intended changes/updates to the previous-mentioned PR. Now off to start doing actual tests for these changes both in regular LAB as well as KLAB2.

wmhutchison commented 1 year ago

PR successfully applied to CLAB/KLAB2 as well as EMERALD. Closing ticket - remaining regular OCP clusters will get updated during the next OCP upgrade.