Response time
Errors
CPU runtime / storage / memory
Apdex (an amalgamated metric approximating user satisfaction)
Alerts should:
Be actionable
Send the permalink to the metric or a dashboard displaying all relevant metrics via runbook and/or a guide to troubleshooting
Be checked mostly after codebase changes but alert us to issues we can't troubleshoot so that they can be ticketed to Acquia
Alerts can:
Monitor modules/hooks
Monitor SQL DB
Other things that aren't directly relevant (?) but may be useful for troubleshooting
A reasonable metric here is if things are 10x the time they'd normally take or 10% the quality they'd normally have or some threshold undefined for errors over 5-15 minutes period, then we should have an alert.
Alerts need:
Response time Errors CPU runtime / storage / memory Apdex (an amalgamated metric approximating user satisfaction) Alerts should:
Be actionable Send the permalink to the metric or a dashboard displaying all relevant metrics via runbook and/or a guide to troubleshooting Be checked mostly after codebase changes but alert us to issues we can't troubleshoot so that they can be ticketed to Acquia Alerts can:
Monitor modules/hooks Monitor SQL DB Other things that aren't directly relevant (?) but may be useful for troubleshooting A reasonable metric here is if things are 10x the time they'd normally take or 10% the quality they'd normally have or some threshold undefined for errors over 5-15 minutes period, then we should have an alert.