DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
5 stars 2 forks source link

Alarms for Lambda metrics #5060

Closed hannes-ucsc closed 8 months ago

hannes-ucsc commented 1 year ago

For the indexer lambdas, use the dashboard as a guide. There are various graphs that show Lambda metrics. Create metric alarms for them. Base the thresholds on the most recent complete reindex in prod with two dcpNN catalogs.

For the service lambdas, use the same metric, but base the thresholds on several weeks of values from prod.


dsotirho-ucsc commented 1 year ago

@hannes-ucsc: "To be specific, we need alarms on the Errors and Throttles metrics."

dsotirho-ucsc commented 1 year ago

Spike to determine good thresholds based on several past prod reindexes.

hannes-ucsc commented 1 year ago

Security review: This change affects the monitoring of the system. It adds alarms that go off when the lambda fails more frequently than is expected. Some lambdas (like the log forwarder) should never fail while some lambdas are expected to fail under pressure, like during a reindex. Specific configurable thresholds address that variability. Overall, this change improves our security posture by increasing the team's awareness of potential problems in a timely manner.

hannes-ucsc commented 1 year ago

To be demoed as part of 4997.