Alerting subscribers of failed report uploads or aggregations (including rate limit excession)

tgeoghegan commented 1 year ago

Report uploads or aggregations could fail due to misconfiguration of clients or a peer DAP aggregator. For example, uploads could fail because the wrong HPKE configuration was used, or report aggregations could fail if clients were configured with the wrong number of histogram buckets. These are scenarios that have to be remedied by someone external to Divvi Up.

To help subscribers resolve this kind of problem, Divvi Up's control and data planes should detect high rates of upload or aggregation failure and somehow inform the subscriber of it. For instance, emailing the subscriber and putting a warning banner on the Divvi Up website when the subscriber logs in or when they navigate to a particular task's page.

tgeoghegan commented 1 year ago

Off the top of my head, there's two ways we could address this:

Task metrics

The current task metrics setup wouldn't cover this, because it only reports counts of successful report aggregations. If a report upload fails, Janus will not write anything to its database and there will be no trace of the failed upload. However we could start recording in the Janus database a counter of failed uploads and then augment the task metrics query so that it would report this value, as well as the currently counts of client_reports and report_aggregations rows. Then, when the control plane fetches task metrics, it could note unusually high counts of failures and notify the subscriber.

Prometheus metrics

If it doesn't already, Janus should be emitting Prometheus metrics for failed uploads and report aggregations. We can then write alerts on those metrics in prometheus-alertmanager, and finally we could teach the control plane to check for active alerts and then notify the subscriber. Prometheus metrics won't have perfect fidelity, but they should suffice to detect statistically significant numbers of failures.

divergentdave commented 1 year ago

I don't think we currently label any Prometheus metrics with the task ID, in part to keep the cardinality down. I'd prefer to not rely on our observability stack for this functionality. Per-task counters in Postgres (or possibly Redis, in the future) should be sufficient.

tgeoghegan commented 1 year ago

Metrics cardinality is an excellent point. And since we already have a mechanism to convey per-task metrics from Janus to divviup-api, adding more data into it seems reasonable. I wrote up https://github.com/divviup/janus/issues/1460 for the Janus side of this.

tgeoghegan commented 7 months ago

Concretely this issue has discussed alerting subscribers if reports get dropped by Janus, but we also should consider the case of uploads or other DAP API requests being dropped due to rate limit excession. There isn't yet a way for divviup-api to find out about that happening (we might need to write a Caddy handler that could emit such a signal). While the work might not happen exclusively in this repository, we'll use this issue to track building that solution.

divviup / divviup-api

Alerting subscribers of failed report uploads or aggregations (including rate limit excession) #172

Task metrics

Prometheus metrics