I couldn’t think of a good way to test the latency alarms, but to test the status alarms I did the following:
To test site down for a “long” time:
Broke the GitHub Pages site by setting the wrong Jekyll build folder
Observed main domain health health checks turn unhealthy after 3 30-second periods (took ~10 minutes for health checkers to start failing though), causing main domain status alarm to enter the ALARM state after another 1-minute metric period and send an email
Observed the main redirect domain status alarm enter ALARM state after its 5-minute metric period (took almost a full half hour after site initially went down for full health check to be unhealthy tho, during which like 50% of health checkers still said site was healthy...), causing AlarmUnhealthyWebsiteBreakingRedirects and AlarmRedirectDomainUnhealthy to both enter ALARM state and send separate emails
Observed other redirect domain health checks remain healthy with 301 response codes
Fixed the GitHub Pages site
Observed all health checks turn back to healthy
Observed main domain health status alarm revert to OK state after another 1-minute metric period, causing AlarmUnhealthyWebsiteBreakingRedirects to enter OK state but AlarmRedirectDomainUnhealthy to still be in ALARM
Observed main redirect domain alarm also reverting to OK state after 5-minute metric period (again there was a big block of time during which ~75% of health checkers were still saying site was unhealthy after 18% cutoff had been reached for whole health check to be healthy), and AlarmRedirectDomainUnhealthy reverting to OK
To test site down “transiently” (not done yet):
Broke the GitHub Pages site by setting the wrong Jekyll build folder
Observed main and main redirect health health checks turn unhealthy after 3 30-second periods
Observed main domain status alarm enter the ALARM state after another 1-minute metric period and send an email
Immediately fixed the GitHub Pages site
Observed all health checks turn back to healthy
Observed main domain health status alarm revert to OK state after another 1-minute metric period
Verified that AlarmUnhealthyWebsiteBreakingRedirects and AlarmRedirectDomainUnhealthy never entered ALARM state or sent emails
Closes #43