Expose report error counts in app Task metrics

moztcampbell commented 11 months ago

[Feature request] For a given task, can the DivviUp task metics show a count of requests to the /task/XXXX/reports endpoint that were rejected and not included in report count. This would be primarily for rough detection of configuration issues and more.

At a minimum, including reports that result in malformed payload (typically due to protocol or task configuration mismatch). Errors around wrong HTTP method, MIME type, etc could be included or excluded based on what is simplest to implement since these are less likely to be the issues we encounter.

jbr commented 11 months ago

This is something we've discussed. I've transferred the issue to janus, which would need to track the statistics for display, as divviup-api does not see reports directly

inahga commented 10 months ago

Looking at our current report upload handling logic, there are several cases that we return client success for:

Public share decoding failed
Report encrypted with outdated or unknown HPKE config
Report decryption failed
Leader input share decoding failed

1 and 4 could be grouped together as their causes are similar (VDAF disagreement, bugs in client-side code).

I suggest we create monotonic counters in the database per task. Roughly this schema:

CREATE TABLE task_upload_counters (
    task_id                BYTEA UNIQUE NOT NULL, -- 32-byte TaskID as defined by the DAP specification
    report_decode_failure  BIGINT,
    report_decrypt_failure BIGINT,
    report_unknown_key     BIGINT,
    report_success         BIGINT

    CONSTRAINT fk_task_id FOREIGN KEY(task_id) REFERENCES tasks(id) ON DELETE CASCADE
);

Rationale for separate table: Prevent UPDATEs from being too costly, as IIUC an UPDATE has similar cost to an INSERT, and the tasks table is large.

Then, we extend the ReportWriteBatcher to also batch writes to this table, depending on the error (or lack thereof) encountered by the report upload handler. We will incur an additional database write for situations where we originally short-circuit, but the queries should be low-cost and fit in with the existing query batching logic. We will also need to carry up these counters through the aggregator API, likely on the existing task metrics endpoint.

These counters are not subject to garbage collection.

We need to be concious of deadlocks when incrementing counters, see https://blog.pjam.me/posts/atomic-operations-in-sql/. When committing a batch, we should sort the counter increment statements by task ID.

The follow up work in divviup-api would be to expose these counters, and preferably provide help text as to the meaning of these counters (I can assist with the copywriting for this).

~It will be up to the subscriber to periodically scrape these counters and derive meaning from them (i.e. observe increases). I think we can facilitate this by exporting a prometheus-like interface in divviup-api for hooking into an existing subscriber's metrics setup, but I'll table that discussion for now as out-of-scope.~ Nope, see https://github.com/divviup/janus/issues/2293#issuecomment-1836580027

branlwyd commented 10 months ago

This looks good. I think we should include the ability to shard writes to these rows to reduce write contention -- otherwise, with this change, if there are two batches writing at the same time that each include an update to a given task upload's counter, the batch-write transactions will contend and one will have to retry.

Specifically, instead of having one row in the new table per task, we'd have a configurable number of rows. Incrementing a counter would be picking a random row and incrementing it. Reading a counter would be reading all of the rows for the task & summing them.

See the batch_aggregations table for an existing implementation of this idea: each row is logically keyed by (task_id, batch_identifier, aggregation_param), but we include an artifical ord field in the key that allows us to generate multiple rows per task/batch/agg param.

N.B. normally, upload errors will probably be pretty rare. We might start with only one shard; I still think addressing write contention is important since introducing shards later would be a breaking change. Also, since the counter increments would be part of the report batch write (which includes writes for arbitrary tasks), I think it's important to mitigate any performance issues to maintain isolation between tasks from different subscribers.

inahga commented 10 months ago

Sharding is a good idea. Note that this also proposes that we track count of successful reports in a report_success column (sorry, this was not clearly stated in the above comment). So I think more than one shard is justified.

Or perhaps we don't provide that information? I think it's useful--as it provides proportionality for the accompanying error counts and it may be more useful for assessing rate of reports than the current counters, which behave more like gauges given that they're subject to GC.

branlwyd commented 10 months ago

Ah, yep, let's definitely shard in that case -- my quick readthrough didn't notice the report_success column. Nice, I think this would serve to produce the "monotonic counter of successful report uploads" we'll require for billing. (edit: that is, this table would provide the desired semantics for the existing reports-uploaded metric we expose)

inahga commented 10 months ago

Summary of some side-channel discussion:

"It will be up to the subscriber to periodically scrape these counters and derive meaning from them" is not correct, we are planning to have divviup-api fetch this data and present it graphically.
We should probably expose all errors to the client instead of returning OK for known errors.
Relatedly, if we're going to track any errors we should go ahead and track all of them.
- Additional errors:
  - Report timestamp too far in the future
  - Task expired
  - Report expired (too old)
Given that the success counter is necessary for billing, it's essentially free (from a performance perspective) to do the additional error tracking.
A load test to check for performance regressions before committing any code to production will be necessary.

inahga commented 9 months ago

A note on scope: this only captures metrics from the perspective of report uploading. There might be interesting insights available from user-facing metrics on aggregation and collection jobs, but that would be out of scope for this issue.

inahga commented 7 months ago

This feature is complete. FYI @moztcampbell. I've also plumbed it through to divviup-api, so the metrics should be visible on task pages.

divviup / janus

Expose report error counts in app Task metrics #2293