Closed mitovskaol closed 3 years ago
Created a sample page here: https://stats.uptimerobot.com/w28pPSLlZE using the ARO cluster.
The tasks above are still required.
@stewartshea to check out the Cerberus - a tool used to monitor Kubernetes/OpenShift clusters - and whether it can be used as a single access point for the Uptime Robot to query all Platform metrics ( 5 OCP essential services + Shared Services like KeyCloak SSO, Artifactory, Vault, etc).
https://github.com/cloud-bulldozer/cerberus
An example config file:
https://github.com/cloud-bulldozer/cerberus/blob/master/config/config.yaml
Cerberus setup in klab cluster and uptime robot integration enabled here: https://stats.uptimerobot.com/w28pPSLlZE
Silver installation will happen with CCM setup, so currently blocked.
@ShellyXueHan What is the difference between how the data is collected for the KLAB Cerberus metric vs KLAB Cluster API?
@mitovskaol KLAB Cerberus
is connected with Cerberus (monitoring details), KLAB Cluster API
is just the healthz endpoint. I setup the two for comparison.
Next step I'll double check with Steven to see how accurate it is, and also check if Cerberus is too aggressive (as you can see it's got several downtime indicators on April 1st and 2nd), if so we'll use Cerberus custom checks instead of default ones.
With Cerberus integrated now, this becomes a bigger task:
Closing this one now.
compare the following endpoints on klab:
openshift
namespace with openshift-bcgov-cerberus
for future integration): https://api.klab.devops.gov.bc.ca:6443/apis/project.openshift.io/v1/projects/openshift-bcgov-cerberusGoing use 2 endpoints for comparison:
Reason not to use:
Need to be done:
we are currently comparing between three monitoring options:
See the different result here: https://stats.uptimerobot.com/4VDxgiJnng
Create a status page in Uptime Robot service for monitoring the Openshift API in the Silver Cluster.
DOD (feel free to update as needed):
[ ] Create a service account in Silver cluster with access to the openshift namespace (a custom role may be needed). Configure the service account rotation and add the service account to CCM to ensure it gets created in all OCP 4 clusters.
[ ] Review and document ways to manage the health checks and announcements in Uptime Robot (ie. codify it's configuration)
[ ] Identify a way to seemlessly send notifications from the ops team to the dashboard
[ ] Create a status page in UptimeRobot with a monitor for the Openshift API endpoint that returns the status of the openshift namespace. Use the service account for accessing the endpoint https://api.silver.devops.gov.bc.ca:6443/apis/project.openshft.io/v1/projects/openshift
In the monitor check that the query actually returns data in addition to OK 200.
The mockup dashboard can be found here https://stats.uptimerobot.com/w28pPSLlZE