jimleroyer commented 1 year ago

Description

As a developer/operator of GC Notify, I would like to only be alerted when there are actual issues with our system, and not during false alarms so that I do not get alert fatigue and am able to quickly identify real errors.

This card covers the NewRelic alarms that happen regularly in the #notification-staging-ops channel.

WHY are we building?

We are receiving a lot of noise in our operations slack channel that are not indicative of actual issues.

WHAT are we building?

Investigate the redis alarms and determine if they can be fixed or if the alarm needs adjustment

VALUE created by our solution

Fewer false alarms will increase developer agility and response times to actual issues.

Acceptance Criteria

[ ] NewRelic alarms are understood and documented

QA Steps

[ ] Bug cards are created if necessary around the alarms if they will repeat
[ ] Less alarms are present after tuning adjustments when possible.

jimleroyer commented 1 year ago

I aligned the alarms in production with the ones in staging. The latter was more sensitive and didn't benefit of the recent changes we did in production.

Error count [Lambda API] (High sensitivity)
Error count [Lambda API] (Low sensitivity)
Error percentage [Lambda API] (High sensitivity)
Error percentage [Lambda API] (Low sensitivity)
Error percentage - API (High)
Response time - Admin (High)
Response time - API (High)

jimleroyer commented 1 year ago

Jimmy to see if we can silence the AuthError and MethodNotFound errors, either transforming these as info log level or silencing these directly in NewRelic.

jimleroyer commented 1 year ago

Jimmy to change the alarm in NewRelic directly, after talking to the team.

jimleroyer commented 1 year ago

I am going with the following diagram for reworking the NewRelic alarms related to errors %. We had high and low sensitivity before, where the former excluded errors from low sensitivity and the latter included all errors.

I regrouped the categories into Unexpected Errors, API User Errors and Fuzzy Attack. The user errors and fuzzy attack all target specific errors and will only trigger on these, where as Unexpected Errors will target all other errors not included by the other two categories. Hence we normally shouldn't have two categories getting triggered at the same time and we should know what's going on with their names.

I've also replicated the same setup with the error count alarms, which are anomaly based, instead of threshold like the previous percentage alarms. That means we have two different means to report on these errors: anomaly and thresholds. I am not sure if we want to keep the two different mechanisms. I guess we wanted to try what works best.

I've rename the alarms with anomaly detection with the word anomaly in the title so we can keep track of these better in the future, in the ops channel.

Overall, we got these redefined alarms along with associated queries

[Lambda API] Error percentage (Unexpected Errors)
SELECT percentage(count(*), WHERE `error.class` IS NOT null)*100 / percentage(count(*), WHERE duration IS NOT null) as 'Error rate (%); filtered' FROM AwsLambdaInvocation, AwsLambdaInvocationError WHERE `entityGuid`='MjY5MTk3NHxJTkZSQXxOQXwtNzgwNDUyNTc5NzAyODI1NTcyNw' AND `error.class` NOT IN ('app.v2.errors:BadRequestError','jsonschema.exceptions:ValidationError', 'sqlalchemy.exc:NoResultFound', 'app.authentication.auth:AuthError') and error.message NOT LIKE '{\'result\': \'error\', \'message\': {\'password\': [\'Incorrect password\']}}'

[Lambda API] Error percentage (API User Errors)
SELECT percentage(count(*), WHERE `error.class` IS NOT null)*100 / percentage(count(*), WHERE duration IS NOT null) as 'Error rate (%); filtered' FROM AwsLambdaInvocation, AwsLambdaInvocationError WHERE `entityGuid`='MjY5MTk3NHxJTkZSQXxOQXwtNzgwNDUyNTc5NzAyODI1NTcyNw' AND `error.class` IN ('jsonschema.exceptions:ValidationError', 'sqlalchemy.exc:NoResultFound')

[Lambda API] Error percentage (Fuzzy attack)
SELECT percentage(count(*), WHERE `error.class` IS NOT null)*100 / percentage(count(*), WHERE duration IS NOT null) as 'Error rate (%); filtered' FROM AwsLambdaInvocation, AwsLambdaInvocationError WHERE `entityGuid`='MjY5MTk3NHxJTkZSQXxOQXwtNzgwNDUyNTc5NzAyODI1NTcyNw' AND `error.class` IN ('app.authentication.auth:AuthError', 'app.v2.errors:BadRequestError', 'werkzeug.exceptions:MethodNotAllowed')

[Lambda API] Errors count anomaly (Unexpected Errors)
SELECT count(*) FROM AwsLambdaInvocationError WHERE (`entityGuid`='MjY5MTk3NHxJTkZSQXxOQXwtNzgwNDUyNTc5NzAyODI1NTcyNw') and error.class NOT IN ('app.v2.errors:BadRequestError','jsonschema.exceptions:ValidationError', 'sqlalchemy.exc:NoResultFound', 'app.authentication.auth:AuthError', 'werkzeug.exceptions:MethodNotAllowed') and error.message NOT LIKE '{\'result\': \'error\', \'message\': {\'password\': [\'Incorrect password\']}}'

[Lambda API] Errors count anomaly (API User Errors)
SELECT count(*) FROM AwsLambdaInvocationError WHERE (`entityGuid`='MjY5MTk3NHxJTkZSQXxOQXwtNzgwNDUyNTc5NzAyODI1NTcyNw') AND `error.class` IN ('jsonschema.exceptions:ValidationError', 'sqlalchemy.exc:NoResultFound')

[Lambda API] Error count anomaly (Fuzzy attack)
SELECT count(*) FROM AwsLambdaInvocationError WHERE (`entityGuid`='MjY5MTk3NHxJTkZSQXxOQXwtNzgwNDUyNTc5NzAyODI1NTcyNw') AND `error.class` IN ('app.authentication.auth:AuthError', 'app.v2.errors:BadRequestError', 'werkzeug.exceptions:MethodNotAllowed')

I also reworked the std deviation of the anomaly based alarms to be fitting of these categories, but it might require some tweaking.

I did not replicate this setup in production yet, I would like to test these out in staging first. but I think that would be better overall than what we have in production.

ben851 commented 1 year ago

No new relic alarms in staging June 21
All alarms that were in staging were claimed by devs
Ben to review/QA

sastels commented 1 year ago

looking good
didn't look at admin, might be worth tweaking that one too
remaining step: apply changes made to staging to productions alarms

jimleroyer commented 1 year ago

I moved the new rule from staging env to production env. We will wait a few days to see how sensitive the new rules could be with production, which might require some further finetuning compared to staging.

sastels commented 1 year ago

Steve to QA (look at the rules in prod and staging)

sastels commented 1 year ago

comparing New Relic production to staging

same:

[Lambda API] Errors count anomaly (Unexpected Errors) (though warning and critical the same? both "deviates from baseline for at least 5 minutes"
[Lambda API] Error percentage (API User Errors)
[Lambda API] Error percentage (Fuzzy attack)
[Lambda API] Error count anomaly (Fuzzy attack)
[Lambda API] Errors count anomaly (Unexpected Errors)
[Lambda API] Error percentage (Unexpected Errors)

ben851 commented 12 months ago

We're watching this in production to ensure that alerts are not too noisy. Will possibly adjust the database response time alert as anomaly detection doesn't seem to be smart enough to take time of day into account.

ben851 commented 12 months ago

Jimmy to adjust the alarm for database transaction time

jimleroyer commented 11 months ago

Modified the alarm around database transaction response time in production. This should be more quiet in the following days. We'll see on Monday if any event was triggered during the weekend.

jimleroyer commented 11 months ago

Database response transaction time didn't show up in production anymore since last change. Moving to done.

cds-snc / notification-planning-core

Investigate New Relic Alarms in staging-ops channel #129

Description

WHY are we building?

WHAT are we building?

VALUE created by our solution

Acceptance Criteria

QA Steps