[Observability] Alert on too few consumer hosts

edx / edx-arch-experiments

A plugin to include applications under development by the architecture team at edx

GNU Affero General Public License v3.0

0 stars 3 forks source link

[Observability] Alert on too few consumer hosts #145

Closed robrap closed 1 year ago

robrap commented 1 year ago

If a single consumer is no longer consuming, we currently won't hear about it. It is possible that Kafka will boot the consumer, but the container might be hung and not replaced.

[ ] ~~Add metric to New Relic for polling. (Requires wrapper in edx-django-utils.)~~
[ ] ~~Replace lost signal alert to check that we always-ish have >=3 hosts that are polling.~~
[x] Update runbook as needed

See comments for replacement idea using Datadog and Confluent Metrics for Consumer Lag.

Additional Notes:

Should be able to do a unique count on hosts.
Might need to fill data gaps with 0.

robrap commented 1 year ago

Blocked waiting on a meeting. Before implementing additional metrics, I plan to meet with John Ford to see if we can get alerting from Datadog on the Confluent metrics instead. We could later replace this once New Relic can get the same Confluent metrics.

robrap commented 1 year ago

Met with John. Set up alert to Opsgenie and Slack. Need to write runbook and complete work, but it is no longer blocked.

robrap commented 1 year ago

In place of the AC as detailed in this ticket, we set up an alert for Consumer Lag from Confluent Metric data that is being sent to Datadog. See the new Runbook here: https://2u-internal.atlassian.net/wiki/spaces/AT/pages/288424140/Event+Bus+alert+runbook+Datadog

Is this replacing "Event consumption below average for more than 30m (all topics)/(prod-course-catalog-info-changed) [P3]"? Do we want to kill that alert? Any others? See https://2u-internal.atlassian.net/wiki/spaces/AT/pages/234258433/Event+Bus+alert+runbook+New+Relic

Reviewer: Please review runbook and the new plan for using Consumer Lag instead of implementing this workaround ticket in New Relic.

robrap commented 1 year ago

@timmc-edx: See my last comment. Are you up for reviewing this?

rgraber commented 1 year ago

Shouldn't we get rid of the Consumer Latency alert from New Relic as well if we think this one is more accurate?

robrap commented 1 year ago

@rgraber: It's not that Consumer Latency isn't accurate, it is just that it is measuring something different. Consumer Lag is the number of messages that the consumer is behind. Consumer Latency is how long it takes for a message to be consumed. One could imagine a situation where these alerts fire together, but also where each fires alone.

robrap commented 1 year ago

I learned that we are still missing an alert (sort of) for when the case where there are no consumers in a group, but we'll rely on the consumption lost signal alert for that special case.

robrap commented 1 year ago

Closing this ticket. In order to avoid custom polling, we went with Consumer Lag from datadog, and we left in the no-signal alert for the rare case that we lost all consumers. The runbook has been consolidated and updated: https://2u-internal.atlassian.net/wiki/spaces/AT/pages/234258433/Event+Bus+alert+runbook