BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

Create Uptime Robot dashboard for Monitoring Openshift API in Silver #828

Closed mitovskaol closed 3 years ago

mitovskaol commented 3 years ago

Create a status page in Uptime Robot service for monitoring the Openshift API in the Silver Cluster.

DOD (feel free to update as needed):

In the monitor check that the query actually returns data in addition to OK 200.

The mockup dashboard can be found here https://stats.uptimerobot.com/w28pPSLlZE

stewartshea commented 3 years ago

Created a sample page here: https://stats.uptimerobot.com/w28pPSLlZE using the ARO cluster.

The tasks above are still required.

mitovskaol commented 3 years ago

@stewartshea to check out the Cerberus - a tool used to monitor Kubernetes/OpenShift clusters - and whether it can be used as a single access point for the Uptime Robot to query all Platform metrics ( 5 OCP essential services + Shared Services like KeyCloak SSO, Artifactory, Vault, etc).

https://github.com/cloud-bulldozer/cerberus

An example config file:

https://github.com/cloud-bulldozer/cerberus/blob/master/config/config.yaml

ShellyXueHan commented 3 years ago

Update:

Cerberus setup in klab cluster and uptime robot integration enabled here: https://stats.uptimerobot.com/w28pPSLlZE

Silver installation will happen with CCM setup, so currently blocked.

mitovskaol commented 3 years ago

@ShellyXueHan What is the difference between how the data is collected for the KLAB Cerberus metric vs KLAB Cluster API?

ShellyXueHan commented 3 years ago

@mitovskaol KLAB Cerberus is connected with Cerberus (monitoring details), KLAB Cluster API is just the healthz endpoint. I setup the two for comparison.

Next step I'll double check with Steven to see how accurate it is, and also check if Cerberus is too aggressive (as you can see it's got several downtime indicators on April 1st and 2nd), if so we'll use Cerberus custom checks instead of default ones.

ShellyXueHan commented 3 years ago

Updates:

With Cerberus integrated now, this becomes a bigger task:

Closing this one now.

ShellyXueHan commented 3 years ago

Update:

compare the following endpoints on klab:

ShellyXueHan commented 3 years ago

Update:

Going use 2 endpoints for comparison:

Reason not to use:

Need to be done:

ShellyXueHan commented 3 years ago

update:

we are currently comparing between three monitoring options:

See the different result here: https://stats.uptimerobot.com/4VDxgiJnng