Create a list of noisy errors/alarms in #notification-staging-ops

jimleroyer commented 1 year ago

Description

As a GCNotify developer, I want to know if my work will cause problem in the staging environment, So that I can resolve ahead of time, But because there is so much alarms noise in staging, it's difficult to tell.

WHY are we building?

We need to discern real issues from noise when building our features.

WHAT are we building?

Create a list of noisy alarms and errors in the notification-staging-ops channel so that we can create future cards to address these issues.

VALUE created by our solution

Solve issues before they hit production, increase team velocity.

Acceptance Criteria

[ ] List of noisy alarms and errors is created
[ ] Cards are generated to investigate/fix these issues

QA Steps

[ ] Review the list to ensure that encompasses all of the messages we see in slack.

sastels commented 1 year ago

A few over the past week:

a lot of New Relic Error anomaly detection - Lambda API (High) logs-1-error-1-minute-warning-lambda-api Error percentage - admin (High)

Warnings going from OK to INSUFFICIENT_DATA: ses-complaint-rate-warning ses-complaint-rate-critical ses-bounce-rate-warning ses-bounce-rate-critical

logs-10-celery-error-1-minute-critical (especially Friday June 2 - were we testing that big template or something?)

ogs-1-500-error-1-minute-warning

sastels commented 1 year ago

Past 2 weeks of AWS alarms: https://docs.google.com/spreadsheets/d/1bQH8p_hSh89vqGfC-_gJsxZlmbzjGEc4Em-ooeWc0r0/edit#gid=0

cds-snc / notification-planning-core