Refactor alerts to distinguish between infrastructure failure, capacity, or service limits

cds-snc / notification-api

GC Notify API | GC Notification API

https://notification.canada.ca

MIT License

51 stars 18 forks source link

Refactor alerts to distinguish between infrastructure failure, capacity, or service limits #1458

Open mohdnr opened 2 years ago

mohdnr commented 2 years ago

Candidates for refactoring:

[ ] logs-10-celery-error-1-minute-critical: This is currently tracking "?\"ERROR/Worker\" ?\"ERROR/ForkPoolWorker\" ?\"WorkerLostError\"" found in cloudwatch eks-cluster/application logs. This is too generic. Identify a way to distinguish between intentional thrown errors (message limits) and legitimate celery failures.

jimleroyer commented 2 years ago

I'd like to know more what you have in mind but I am unsure that these described filters are problematic or too generic, as they don't target infra, nor capacity or service limits but tells us when a worker fails. For example, if an email could not be sent due to a job failure.

mohdnr commented 2 years ago

@jimleroyer the alert generated by this alarm was deemed to be caused by a service exceeding their limit. I was curious to investigate the relationship between that happening and this alarm triggering.

jimleroyer commented 2 years ago

Oh true, I reviewed some of the alerts of past week and I realized the service rate limit did indeed triggered from a celery worker alert. Let's refine it so at least the rate limit goes into their own alert (warning level probably) and not trigger a Celery worker alert.

patheard commented 2 years ago

Another one that could be filtered out is the psycopg2.errors.UniqueViolation error since a small number of those are expected.

Here’s what it would look like to stop psycopg2.errors.UniqueViolation errors from being counted by the celery-error filter:

[(err="*ERROR/Worker*" || err="*ERROR/ForkPoolWorker*" || err="*WorkerLostError*") && err!="*psycopg2.errors.UniqueViolation*"]

You could use this same pattern to tune out any Celery errors that are expected in low quantities.

jimleroyer commented 2 years ago

@patheard We might want to keep an alarm on high volume of unique violation / duplicate key error. So we could have an alarm filtering these out completely as you suggested and a dedicated one for these with a proper threshold on a period of time.