Open cben opened 9 years ago
Also consider cliend-site performance and JS error logging. It can't catch server not responding though.
E.g. http://www.lognormal.com/boomerang/doc/ & https://github.com/nature/boomcatch, https://github.com/getsentry/sentry (I'm tempted to write the data to Firebase instead of running separate server, but won't realistically ever get around to coding that. Also, at some point I'll want an open backend alternative to Firebase.)
https://blog.openshift.com/openshift-logs-metrics-management-logstash-graphite/ lists open logging/monitoring tools [for OpenShift Enterprise, not sure what applies to OpenShift Online]
Collecting client-side errors would be nice, e.g. I would have learnt of #85 much earlier. More tools at https://github.com/cjbarber/ToolsOfTheTrade#errorexception-handling Reminder to self: if I implementing that, I should scrub console log of fragments of document content (notably CM-MJ spews all formulas).
The main lesson from #100 (40 min downtime on 2015-06-12, little leads into why) are:
Unrelated: looking at pingdom and uptimerobot, I see occasional latency spikes to 1.5–2sec. Why? Are these something negligible like single random lost packet, or are they times of slowness when loading the whole page would take tens of seconds? Also, pingdom believes the baseline latency is >300ms while uptimerobot has <100ms. Probably because they're pinging from different geographic locations. Would be nice to have server-side latency metrics. Better yet (in case server is too loaded to even accept connections quickly), haproxy-side metrics.
As lesson from #104 (huge TTL on mathdown.{net,com} but OK on www.mathdown.{net,com}), I've added all domain variations (and the underlying mathdown-cben.rhcloud.com) to Uptimerobot (less used ones with 30min freq).
Another lesson is that I learned to use Pingdom and it's indeed more informative on failures (shows IP tried, runs traceroute, full HTTP exchange). Probably should upgrade to paid plan and add all domain variations there.
FOSS alternatives to check out: http://cabotapp.com/ https://github.com/fzaninotto/uptime
found haproxy config: haproxy/conf/haproxy.cfg
under home dir on first gear (apparently generated from https://github.com/openshift/origin-server/blob/master/cartridges/openshift-origin-cartridge-haproxy/versions/1.4/configuration/haproxy.cfg.erb)
Still don't understand the haproxy.log format — it's not the "httplog" configured there.
Lesson from #117: it's hard to understand how realistic DNS-caching users experience DNS flips. In this instance Pingdom saw the flip immediately, Uptimerobot apparently used outdated DNS for a hour or two (inferred, there is no info).
Pingdom is reducing features on free plan: https://www.pingdom.com/planfree Notably, I'll lose: Public static page, 1min->5min freq, Root cause analysis (extra probing when down).
I'm getting free Starter trial till January 28, can upgrade until Dec 29 for $7/mo for first year. That's somewhat tempting, but I'm more interested in reducing expenditure now.
I've been on https://updown.io/ for a while, pretty happy with it.
Currently I won't be notified app on RHcloud/Heroku is overloaded/returning errors (unless it's bad enough for pingdom/uptimerobot check to fail). What's worse, I don't have any good way to observe current/recent status!