bcgov / digital-journeys

PSA Forms System
https://bcgov.github.io/digital-journeys/
Apache License 2.0
8 stars 7 forks source link

Explore: Improving Platform Health Checks #1386

Open iman-jamali-fw opened 8 months ago

iman-jamali-fw commented 8 months ago
  1. Evaluate Existing Health Checks: Start by assessing the current health check mechanisms of different components within the DGJ platform. These checks typically utilize HTTP endpoints and should also encompass the health of the platform's database. This review will ensure a comprehensive understanding of the platform's overall health.

  2. Explore User-Friendly Health Monitoring: Explore how to empower DGJ administrators with user-friendly tools to monitor platform health. Consider options like developing a custom user interface (UI) integrated into the website. This UI can display real-time information about the health of various platform components. Alternatively, leverage existing monitoring tools to track and visualize the health of these components, making it easier for administrators to identify and address potential issues.

  3. Explore Proactive Issue Notifications: Explore the steps to proactively notify DGJ users about potential issues to prevent disruptions. For instance, during platform upgrades (e.g., OpenShift updates), implement a gentle warning system that informs users of the ongoing maintenance, reducing the risk of unexpected downtime. Additionally, explore the possibility of disabling form submissions if any critical components (such as the API, formio, or Camunda) are experiencing outages. This measure ensures that submissions are only accepted when all necessary services are in good health, maintaining the integrity of the platform's functionality."

iman-jamali-fw commented 8 months ago

@warrenchristian1telus @bhumin-fw @Andrew-Vargas-bcgov Please consider adding your ideas to this ticket or if you know of any existing tools that can help improve health checks. Thank you

warrenchristian1telus commented 8 months ago

I think we should look into using Sysdig alerts to broadcast relevant events to appropriate team members. For example, Camunda pod disruptions could alert DevOps to investigate outages, developers to check and re-run failed form submissions, and team members to be aware of any current outages or issues that may affect users. This could also alert of issues before they become critical, such as low disk space or high CPU load.

Alerts could be sent via Rocket.Chat webhook(s) to notify relevant team members. potentially having separate alert channels fro DevOps, developers and team members.

fazil-ey commented 2 months ago

We have some health check notifications in place. @bhumin-fw to check with @warrenchristian1telus and confirm if what we have for health checks currently if good enough or not.

Can we also ensure notifications go to a common email or teams channel instead of personal emails.

warrenchristian1telus commented 2 months ago

There are a few "layers" of notifications and alerts available. OpenShift, Sysdig, UptimeRobot, We can certainly discuss what needs to go where. I believe some of these alerts (OpenShift, UptimeRobot) can only go to email. UptimeRobot can alert other ways, but I believe they are only available with a paid subscription. Sysdig can technically send to Rocket.Chat, but we've had issues getting any useful notifications so far. They do appear, but they are empty, and therefore not particularly helpful. Although they do let us know there is something to look for.

fazil-ey commented 1 month ago

There are no immediate concerns that are major risks to our operations. These are nice to have improvements. Iceboxing.

Action item - send notification to a common team channel via email (digitial journeys email or teams channel email)