DerploidEntertainment / Website

Infrastructure as Code and GitHub Pages sources for the Derploid website
https://www.derploid.com/
MIT License
2 stars 0 forks source link

Health check alarms #56

Closed Rabadash8820 closed 2 years ago

Rabadash8820 commented 2 years ago

Closes #43

Rabadash8820 commented 2 years ago

I couldn’t think of a good way to test the latency alarms, but to test the status alarms I did the following:

To test site down for a “long” time:

  1. Broke the GitHub Pages site by setting the wrong Jekyll build folder
  2. Observed main domain health health checks turn unhealthy after 3 30-second periods (took ~10 minutes for health checkers to start failing though), causing main domain status alarm to enter the ALARM state after another 1-minute metric period and send an email
  3. Observed the main redirect domain status alarm enter ALARM state after its 5-minute metric period (took almost a full half hour after site initially went down for full health check to be unhealthy tho, during which like 50% of health checkers still said site was healthy...), causing AlarmUnhealthyWebsiteBreakingRedirects and AlarmRedirectDomainUnhealthy to both enter ALARM state and send separate emails
  4. Observed other redirect domain health checks remain healthy with 301 response codes
  5. Fixed the GitHub Pages site
  6. Observed all health checks turn back to healthy
  7. Observed main domain health status alarm revert to OK state after another 1-minute metric period, causing AlarmUnhealthyWebsiteBreakingRedirects to enter OK state but AlarmRedirectDomainUnhealthy to still be in ALARM
  8. Observed main redirect domain alarm also reverting to OK state after 5-minute metric period (again there was a big block of time during which ~75% of health checkers were still saying site was unhealthy after 18% cutoff had been reached for whole health check to be healthy), and AlarmRedirectDomainUnhealthy reverting to OK

To test site down “transiently” (not done yet):

  1. Broke the GitHub Pages site by setting the wrong Jekyll build folder
  2. Observed main and main redirect health health checks turn unhealthy after 3 30-second periods
  3. Observed main domain status alarm enter the ALARM state after another 1-minute metric period and send an email
  4. Immediately fixed the GitHub Pages site
  5. Observed all health checks turn back to healthy
  6. Observed main domain health status alarm revert to OK state after another 1-minute metric period
  7. Verified that AlarmUnhealthyWebsiteBreakingRedirects and AlarmRedirectDomainUnhealthy never entered ALARM state or sent emails