Monitoring & alerting - Githubissues

cben commented 9 years ago

Currently I won't be notified app on RHcloud/Heroku is overloaded/returning errors (unless it's bad enough for pingdom/uptimerobot check to fail). What's worse, I don't have any good way to observe current/recent status!

cben commented 9 years ago

Also consider cliend-site performance and JS error logging. It can't catch server not responding though.

E.g. http://www.lognormal.com/boomerang/doc/ & https://github.com/nature/boomcatch, https://github.com/getsentry/sentry (I'm tempted to write the data to Firebase instead of running separate server, but won't realistically ever get around to coding that. Also, at some point I'll want an open backend alternative to Firebase.)

cben commented 9 years ago

https://blog.openshift.com/openshift-logs-metrics-management-logstash-graphite/ lists open logging/monitoring tools [for OpenShift Enterprise, not sure what applies to OpenShift Online]

cben commented 9 years ago

Collecting client-side errors would be nice, e.g. I would have learnt of #85 much earlier. More tools at https://github.com/cjbarber/ToolsOfTheTrade#errorexception-handling Reminder to self: if I implementing that, I should scrub console log of fragments of document content (notably CM-MJ spews all formulas).

cben commented 9 years ago

The main lesson from #100 (40 min downtime on 2015-06-12, little leads into why) are:

[ ] improve my understanding of haproxy (and in particular learn to read or change its log time(?) format).
[ ] on-server resource metrics (e.g. was it running out of cpu or ram to the point it became unresponsive?)

Unrelated: looking at pingdom and uptimerobot, I see occasional latency spikes to 1.5–2sec. Why? Are these something negligible like single random lost packet, or are they times of slowness when loading the whole page would take tens of seconds? Also, pingdom believes the baseline latency is >300ms while uptimerobot has <100ms. Probably because they're pinging from different geographic locations. Would be nice to have server-side latency metrics. Better yet (in case server is too loaded to even accept connections quickly), haproxy-side metrics.

cben commented 9 years ago

[ ] log responses: at least non-200 responses, and perhaps latency (ideally first and last byte?) for all requests.

cben commented 9 years ago

As lesson from #104 (huge TTL on mathdown.{net,com} but OK on www.mathdown.{net,com}), I've added all domain variations (and the underlying mathdown-cben.rhcloud.com) to Uptimerobot (less used ones with 30min freq).

Another lesson is that I learned to use Pingdom and it's indeed more informative on failures (shows IP tried, runs traceroute, full HTTP exchange). Probably should upgrade to paid plan and add all domain variations there.

cben commented 9 years ago

FOSS alternatives to check out: http://cabotapp.com/ https://github.com/fzaninotto/uptime

cben commented 9 years ago

found haproxy config: haproxy/conf/haproxy.cfg under home dir on first gear (apparently generated from https://github.com/openshift/origin-server/blob/master/cartridges/openshift-origin-cartridge-haproxy/versions/1.4/configuration/haproxy.cfg.erb) Still don't understand the haproxy.log format — it's not the "httplog" configured there.

cben commented 9 years ago

Lesson from #117: it's hard to understand how realistic DNS-caching users experience DNS flips. In this instance Pingdom saw the flip immediately, Uptimerobot apparently used outdated DNS for a hour or two (inferred, there is no info).

cben commented 8 years ago

Pingdom is reducing features on free plan: https://www.pingdom.com/planfree Notably, I'll lose: Public static page, 1min->5min freq, Root cause analysis (extra probing when down).

I'm getting free Starter trial till January 28, can upgrade until Dec 29 for $7/mo for first year. That's somewhat tempting, but I'm more interested in reducing expenditure now.

cben commented 1 year ago

I've been on https://updown.io/ for a while, pretty happy with it.

[ ] update deployment/README.md
[ ] write down where I've set up TLS cert monitoring

cben / mathdown

Monitoring & alerting #78