Closed moztcampbell closed 7 months ago
This is something we've discussed. I've transferred the issue to janus, which would need to track the statistics for display, as divviup-api does not see reports directly
Looking at our current report upload handling logic, there are several cases that we return client success for:
1 and 4 could be grouped together as their causes are similar (VDAF disagreement, bugs in client-side code).
I suggest we create monotonic counters in the database per task. Roughly this schema:
CREATE TABLE task_upload_counters (
task_id BYTEA UNIQUE NOT NULL, -- 32-byte TaskID as defined by the DAP specification
report_decode_failure BIGINT,
report_decrypt_failure BIGINT,
report_unknown_key BIGINT,
report_success BIGINT
CONSTRAINT fk_task_id FOREIGN KEY(task_id) REFERENCES tasks(id) ON DELETE CASCADE
);
Rationale for separate table: Prevent UPDATEs from being too costly, as IIUC an UPDATE has similar cost to an INSERT, and the tasks
table is large.
Then, we extend the ReportWriteBatcher to also batch writes to this table, depending on the error (or lack thereof) encountered by the report upload handler. We will incur an additional database write for situations where we originally short-circuit, but the queries should be low-cost and fit in with the existing query batching logic. We will also need to carry up these counters through the aggregator API, likely on the existing task metrics endpoint.
These counters are not subject to garbage collection.
We need to be concious of deadlocks when incrementing counters, see https://blog.pjam.me/posts/atomic-operations-in-sql/. When committing a batch, we should sort the counter increment statements by task ID.
The follow up work in divviup-api would be to expose these counters, and preferably provide help text as to the meaning of these counters (I can assist with the copywriting for this).
~It will be up to the subscriber to periodically scrape these counters and derive meaning from them (i.e. observe increases). I think we can facilitate this by exporting a prometheus-like interface in divviup-api for hooking into an existing subscriber's metrics setup, but I'll table that discussion for now as out-of-scope.~ Nope, see https://github.com/divviup/janus/issues/2293#issuecomment-1836580027
This looks good. I think we should include the ability to shard writes to these rows to reduce write contention -- otherwise, with this change, if there are two batches writing at the same time that each include an update to a given task upload's counter, the batch-write transactions will contend and one will have to retry.
Specifically, instead of having one row in the new table per task, we'd have a configurable number of rows. Incrementing a counter would be picking a random row and incrementing it. Reading a counter would be reading all of the rows for the task & summing them.
See the batch_aggregations
table for an existing implementation of this idea: each row is logically keyed by (task_id, batch_identifier, aggregation_param)
, but we include an artifical ord
field in the key that allows us to generate multiple rows per task/batch/agg param.
N.B. normally, upload errors will probably be pretty rare. We might start with only one shard; I still think addressing write contention is important since introducing shards later would be a breaking change. Also, since the counter increments would be part of the report batch write (which includes writes for arbitrary tasks), I think it's important to mitigate any performance issues to maintain isolation between tasks from different subscribers.
Sharding is a good idea. Note that this also proposes that we track count of successful reports in a report_success
column (sorry, this was not clearly stated in the above comment). So I think more than one shard is justified.
Or perhaps we don't provide that information? I think it's useful--as it provides proportionality for the accompanying error counts and it may be more useful for assessing rate of reports than the current counters, which behave more like gauges given that they're subject to GC.
Ah, yep, let's definitely shard in that case -- my quick readthrough didn't notice the report_success
column. Nice, I think this would serve to produce the "monotonic counter of successful report uploads" we'll require for billing. (edit: that is, this table would provide the desired semantics for the existing reports-uploaded metric we expose)
Summary of some side-channel discussion:
A note on scope: this only captures metrics from the perspective of report uploading. There might be interesting insights available from user-facing metrics on aggregation and collection jobs, but that would be out of scope for this issue.
This feature is complete. FYI @moztcampbell. I've also plumbed it through to divviup-api, so the metrics should be visible on task pages.
[Feature request] For a given task, can the DivviUp task metics show a count of requests to the
/task/XXXX/reports
endpoint that were rejected and not included in report count. This would be primarily for rough detection of configuration issues and more.At a minimum, including reports that result in malformed payload (typically due to protocol or task configuration mismatch). Errors around wrong HTTP method, MIME type, etc could be included or excluded based on what is simplest to implement since these are less likely to be the issues we encounter.