NYPL / engineering-general

Standards, values, and other information relevant to the NYPL Engineering Team.
48 stars 3 forks source link

Draft post mortem es deletion #86

Closed nonword closed 5 years ago

nonword commented 5 years ago

Adds a post-mortem for recent SCC outage.

nonword commented 5 years ago
  1. Does this mean it should be standard practice to have alarms looking for degraded health?

Perhaps! It could be a new stated SHOULD under "Alerting" for example. I'm not sure it's a MUST.. It might depend on the app.. I would think an app with decent error logging would trigger alarms at the same time for most issues. Having lots of alarms firing off false negatives is a possibility too, so I'm not ready to say all services should do this yet.

  1. Does the fact that the health jumped around for a while before the new index was created/ ...

Yeah I can't really sort out whether the bouncy health was a cause or symptom. The timing probably tells someone something but it doesn't tell me much. The granularity isn't great and we didn't have error logging during the incident

nonword commented 5 years ago

Following up on IRL discussion, I looked into whether there's any possibility to log some/all queries to a log. Ideally one could opt to log just index operations, for example. The product appears to offer three kinds of logging, but none of them suit us. ("Application" logging is enabled, but doesn't log index operations; The other two are "slow" logs) There is also a CloudTrail connection for logging calls to the configuration API but that would only capture domain-level operations (e.g. domain creation, configuration changes) - not index-level operations. As far as I can tell there's no way to log all queries (or all queries of a certain type).