Closed hannes-ucsc closed 8 months ago
@hannes-ucsc: "To be specific, we need alarms on the Errors and Throttles metrics."
Spike to determine good thresholds based on several past prod
reindexes.
Security review: This change affects the monitoring of the system. It adds alarms that go off when the lambda fails more frequently than is expected. Some lambdas (like the log forwarder) should never fail while some lambdas are expected to fail under pressure, like during a reindex. Specific configurable thresholds address that variability. Overall, this change improves our security posture by increasing the team's awareness of potential problems in a timely manner.
To be demoed as part of 4997.
For the indexer lambdas, use the dashboard as a guide. There are various graphs that show Lambda metrics. Create metric alarms for them. Base the thresholds on the most recent complete reindex in
prod
with two dcpNN catalogs.For the service lambdas, use the same metric, but base the thresholds on several weeks of values from
prod
.