Closed robrap closed 1 year ago
Blocked waiting on a meeting. Before implementing additional metrics, I plan to meet with John Ford to see if we can get alerting from Datadog on the Confluent metrics instead. We could later replace this once New Relic can get the same Confluent metrics.
Met with John. Set up alert to Opsgenie and Slack. Need to write runbook and complete work, but it is no longer blocked.
In place of the AC as detailed in this ticket, we set up an alert for Consumer Lag from Confluent Metric data that is being sent to Datadog. See the new Runbook here: https://2u-internal.atlassian.net/wiki/spaces/AT/pages/288424140/Event+Bus+alert+runbook+Datadog
Is this replacing "Event consumption below average for more than 30m (all topics)/(prod-course-catalog-info-changed) [P3]"? Do we want to kill that alert? Any others? See https://2u-internal.atlassian.net/wiki/spaces/AT/pages/234258433/Event+Bus+alert+runbook+New+Relic
Reviewer: Please review runbook and the new plan for using Consumer Lag instead of implementing this workaround ticket in New Relic.
@timmc-edx: See my last comment. Are you up for reviewing this?
Shouldn't we get rid of the Consumer Latency alert from New Relic as well if we think this one is more accurate?
@rgraber: It's not that Consumer Latency isn't accurate, it is just that it is measuring something different. Consumer Lag is the number of messages that the consumer is behind. Consumer Latency is how long it takes for a message to be consumed. One could imagine a situation where these alerts fire together, but also where each fires alone.
I learned that we are still missing an alert (sort of) for when the case where there are no consumers in a group, but we'll rely on the consumption lost signal alert for that special case.
Closing this ticket. In order to avoid custom polling, we went with Consumer Lag from datadog, and we left in the no-signal alert for the rare case that we lost all consumers. The runbook has been consolidated and updated: https://2u-internal.atlassian.net/wiki/spaces/AT/pages/234258433/Event+Bus+alert+runbook
If a single consumer is no longer consuming, we currently won't hear about it. It is possible that Kafka will boot the consumer, but the container might be hung and not replaced.
Add metric to New Relic for polling. (Requires wrapper in edx-django-utils.)Replace lost signal alert to check that we always-ish have >=3 hosts that are polling.See comments for replacement idea using Datadog and Confluent Metrics for Consumer Lag.
Additional Notes: