Closed wmhutchison closed 1 year ago
Not started yet. Need to first come to a consensus whether or not for these checks we switch across the board to using Nagios' test nginx container or use something else.
Increased priority for this ticket will be provided for the Sprint starting May 25th.
With the demands of OCP 4.12 and associated NSX patching now out of this way, can put more cycles into this.
Based on what I can see, existing firewall rules will allow Nagios to hit any destination URL we want to check, so long as it's on port 443.
Proxy_Ingress checks will probably need to be dropped entirely, since the current architecture of STMS, CORP and SDN/NSX, the kamproxy/calproxy systems are unable to resolve hostnames managed by AVI DNS. It's for this reason why we have to setup a proxy exception in our SAG browsers if we want to hit any URL managed by AVI DNS.
Proxy_Ingress will need a re-design for KLAB2/EMERALD since by design the CORP web proxies involved cannot reach the destination private IP ranges.
Speaking with Steven Barre during a recent meeting on this, we can work around this by refactoring this check on KLAB2/.EMERALD to be done via an issuance of a public IP address. Will look into implementing this and seeing if a new generic ticket to DXCAS SAM team will be needed as well for accessing this or not. In the meantime, will start looking at the Github repo for managing these Nagios checks and see how easy or not it will be to add either a check implemented differerently depending on cluster, or creating a unique/different check for just the NSX clusters.
Ingress check will need some adjustment due to how it was configured. Currently not-working for NSX because of the differerence in content size for the Nagios nginx pod versus the content presented by the Openshift ingress pods.
Thus we need to change
--pagesize 2048:8192
to
--pagesize 200:8192
Will make this change manually for the KLAB2 check to first confirm.
I have also changed the route for nginx-insecure so that it will issue a public IP, needed in order for the CORP web proxy be able to reach the host. since this route is already in use with other Nagios Active checks, waiting first to make sure this doesn't cause breakage elsewhere first.
Gave the insecure route a public IP, re-tested the Proxy__Ingress check for KLAB2 after updating that check to use the proper/new route. Still no dice. Assuming it's being blocked in NSX by a rule that's not logging itself, since not seeing anything in regular firewall logs for this. Will follow-up on Monday.
We never put in the SIS rule for port 80 since "no prod URL should be using non-https". But then, since we can't user the default hostname either ... not sure how to deal with a TLS cert ...
I can hit the Cerberus KLAB2 URL (converted to HTTP so that Shelly didn't have to buy a cert for just a True/False response URL) from my home connection (http://cerberus.klab2.developer.gov.bc.ca/) so SiS seems to be opened up for port 80, or at least via the VIP range(s) on KLAB2. Tried the EMERALD Cerberus HTTP address, also works.
So SiS is currently allowing port 80 through, perhaps only for specific VIP ranges.
Huh, guess my info is out of date then. Nevermind me! Good luck on your debugging efforts
After a session with Dan Deane to go over all of this, the Ingress check now works. DXCAS Nagios is covered by NSX, but a separate Corporate NSX instance, meaning the logs it was generating was not visible to me or other team members in the BC Gov VRLI web portal. Dan thus added a DFW entry to the CORP NSX instance to fix.
Proper unit testing certainly helps as well for the last check here (Proxy_Ingress). Used my personal VirtualBox VM which has pre-baked the use of the involved CORP web proxies. Found out, or more accurately, was reminded that URLs managed by AVI whose DNS is also handled by AVI, will not be able to be resolved by the CORP proxy servers, since they're not able to reach AVI DNS, and it's doubtful such access would be allowed.
Thus we will need to treat this in the same fashion as we ask our users, in terms of setting up a "vanity domain". The downside here is that setting up Nagios on a new NSX-backed cluster isn't 100% automatic (yet) since an NNR entry will need to be created to handle this specific scenario.
Alright, automation time. Copy/pasted out the Nagios command line for both Ingress and Proxy_Ingress for both KLAB and KLAB2.
Would be fine with going with variable definition/substitutions for this if it was just the testing URL that changes, but regular tests are being done against the Openshift internal router pods, which give distinct/unique responses compared to what's available to us otherwise.
Based on technical limitations around some Nagios checks as well as how AVI renders some of the routes, the following routes will be needed.
It's possible to try and be more efficient with number of routes by putting more checks on the last route, since it supports both HTTP and HTTPS queries. We however want to avoid the actual usage of public VIPs here unless absolutely necessary.
For the sake of this ticket, will take https://github.com/bcgov-c/platform-tools/blob/main/nagios-speed/nginx-speed.yaml and add a separate YAML for just NSX clusters.
Need to check inside the Nagios playbooks to see what's already there for cluster detection, seems like nagios_cluster
gets set based on the ConfigMap value for NAGIOS_CLUSTER
which will suffice for just KLAB2/EMERALD decision making here. If we end up with more NSX clusters, then we might end up baking an NSX variable into the configmap.
Noted how the Nagios container is building out the various checks, it compiles a list of checks into a single fact. Will take this and split things up into three categories.
Will then compile the final facts var based on which type of cluster we're running from. This also means we don't have to customize the monitoring names in Nagios.
It also works towards making people adding new Nagios checks to think about how they work (do they work on all clusters, or just NSX and/or regular Openshift? The answer to that question tells the author where to place their new check).
This works well for ensuring we don't need to remove un-needed checks, since at present the image when re-ran in KLAB2/EMERALD will try and add the PostGress DB checks meant to QA the F5 CIS stuff.
https://github.com/bcgov-c/platform-tools/pull/159 now exists in draft, continues to be a WIP. Once done, will test on one of the regular LAB clusters first, then jump over to KLAB2 before taking the resulting work out of Draft.
barring bug-fixes, applied the last of the intended changes/updates to the previous-mentioned PR. Now off to start doing actual tests for these changes both in regular LAB as well as KLAB2.
PR successfully applied to CLAB/KLAB2 as well as EMERALD. Closing ticket - remaining regular OCP clusters will get updated during the next OCP upgrade.
Describe the issue At present all of the Ingress checks from Nagios Active are being done by hitting https://apps..devops.gov.bc.ca/ and looking for an "503 Service Unavailable" as well as "Application is not available".
This check only works on regular Openshift clusters by virtue of how the Openshift ingress router pods work in comparison to the NSX AVI front-end. We thus want to ideally pick a new URL that will be used instead and can be used on all Openshift clusters regardless whether or not we are talking NSX-backed or not. A possible suggestion is to make use of the existing Nagios nginx test pod and set up a suitable test that will generate a consistent response across all clusters.
Additional context Fulfilling this ticket will take care of the following Nagios checks.
How does this benefit the users of our platform? Ensuring Ingress services are working as expected on all clusters, including NSX-backed clusters.
Definition of done
Ensure firewall requests go in for allowing access to the involved VIPs.