Alarms for Lambda metrics #5060

Closed hannes-ucsc closed 8 months ago

For the indexer lambdas, use the dashboard as a guide. There are various graphs that show Lambda metrics. Create metric alarms for them. Base the thresholds on the most recent complete reindex in prod with two dcpNN catalogs.

For the service lambdas, use the same metric, but base the thresholds on several weeks of values from prod.

[x] Security design review completed; the Resolution of this issue does not …
- [x] … affect authentication; for example:
- OAuth 2.0 with the application (API or Swagger UI)
- Authentication of developers with Google Cloud APIs
- Authentication of developers with AWS APIs
- Authentication with a GitLab instance in the system
- Password and 2FA authentication with GitHub
- API access token authentication with GitHub
- Authentication with
- [x] … affect the permissions of internal users like access to
- Cloud resources on AWS and GCP
- GitLab repositories, projects and groups, administration
- an EC2 instance via SSH
- GitHub issues, pull requests, commits, commit statuses, wikis, repositories, organizations
- [x] … affect the permissions of external users like access to
- TDR snapshots
- [x] … affect permissions of service or bot accounts
- Cloud resources on AWS and GCP
- [x] … affect audit logging in the system, like
- adding, removing or changing a log message that represents an auditable event
- changing the routing of log messages through the system
- [ ] ~… affect monitoring of the system~ https://github.com/DataBiosphere/azul/issues/5060#issuecomment-1635031718
- [x] … introduce a new software dependency like
- Python packages on PYPI
- Command-line utilities
- Docker images
- Terraform providers
- [x] … add an interface that exposes sensitive or confidential data at the security boundary
- [x] … affect the encryption of data at rest
- [x] … require persistence of sensitive or confidential data that might require encryption at rest
- [x] … require unencrypted transmission of data within the security boundary
- [x] … affect the network security layer; for example by
- modifying, adding or removing firewall rules
- modifying, adding or removing security groups
- changing or adding a port a service, proxy or load balancer listens on
[x] Documentation on any unchecked boxes is provided in comments below

@hannes-ucsc: "To be specific, we need alarms on the Errors and Throttles metrics."

Spike to determine good thresholds based on several past prod reindexes.

Security review: This change affects the monitoring of the system. It adds alarms that go off when the lambda fails more frequently than is expected. Some lambdas (like the log forwarder) should never fail while some lambdas are expected to fail under pressure, like during a reindex. Specific configurable thresholds address that variability. Overall, this change improves our security posture by increasing the team's awareness of potential problems in a timely manner.

To be demoed as part of 4997.

DataBiosphere / azul

Alarms for Lambda metrics #5060