Open mohdnr opened 2 years ago
I'd like to know more what you have in mind but I am unsure that these described filters are problematic or too generic, as they don't target infra, nor capacity or service limits but tells us when a worker fails. For example, if an email could not be sent due to a job failure.
@jimleroyer the alert generated by this alarm was deemed to be caused by a service exceeding their limit. I was curious to investigate the relationship between that happening and this alarm triggering.
Oh true, I reviewed some of the alerts of past week and I realized the service rate limit did indeed triggered from a celery worker alert. Let's refine it so at least the rate limit goes into their own alert (warning level probably) and not trigger a Celery worker alert.
Another one that could be filtered out is the psycopg2.errors.UniqueViolation
error since a small number of those are expected.
Here’s what it would look like to stop psycopg2.errors.UniqueViolation
errors from being counted by the celery-error filter:
[(err="*ERROR/Worker*" || err="*ERROR/ForkPoolWorker*" || err="*WorkerLostError*") && err!="*psycopg2.errors.UniqueViolation*"]
You could use this same pattern to tune out any Celery errors that are expected in low quantities.
@patheard We might want to keep an alarm on high volume of unique violation / duplicate key error. So we could have an alarm filtering these out completely as you suggested and a dedicated one for these with a proper threshold on a period of time.
Candidates for refactoring:
"?\"ERROR/Worker\" ?\"ERROR/ForkPoolWorker\" ?\"WorkerLostError\""
found in cloudwatcheks-cluster/application
logs. This is too generic. Identify a way to distinguish between intentional thrown errors (message limits) and legitimate celery failures.