Collection job creation: batch size check counts client reports, rather than successful aggregations

branlwyd commented 1 year ago

Next, the Aggregator checks that batch contains a valid number of reports, as determined by the query type. If the size check fails, then the Aggregator MUST abort with error of type "invalidBatchSize".

This check is implemented in Janus here: https://github.com/divviup/janus/blob/eab56d09fb4273e1b24244745b5a5fd45c283098/aggregator/src/aggregator.rs#L1812-L1819

report_count comes from a query-type-specific function. In the time-interval case, this is a count on the client_reports table; but this count is incorrect, since it counts reports which have failed or not begun aggregation. In the fixed-size case, this is a count on the report_aggregations table; but this count is also incorrect, since it counts reports which have failed aggregation.

I suspect we want this count to be: look up all the relevant batch_aggregations rows based on the incoming collection identifier, and sum their report_counts. This would give an accurate lower bound on the count of reports which will be included in this collection.

branlwyd commented 1 year ago

Thoughts/caveats:

There are some downsides no matter how we define this check. If we don't accept collection requests until enough reports are successfully aggregated, issues in the aggregation process might delay a Collector's ability to issue a (successful) collection request, for reasons that are opaque to the Collector.
Fixed-size requests also have a max_batch_size, so determining a lower bound on the report count doesn't validate we fulfill the requirements. However, the fixed-size query type generates batches to the correct size with no ability for the Collector to control batch size, so this practically doesn't matter much.

Given these caveats (especially the first one), a better change might be to drop this check entirely from this section of the specification & instead perform a check once aggregation is complete, just before sending the AggregateShareReq to the helper -- at that point the Leader will know exactly how many reports are included in the batch & can accurately verify batch-size checks. But this would require a change in DAP.

branlwyd commented 1 year ago

Tim pointed out that if we don't accept collection requests until enough reports are successfully aggregated, VDAFs with nontrivial aggregation parameters (e.g. Poplar1) will be broken since aggregation jobs aren't created until a collection request is received for such VDAFs.

This causes me to lean more towards moving the Leader's batch-size check to "just before issuing an AggregateShareReq", which would allow the check to be precise (i.e. the exact report count is known) and not impede Poplar1.

branlwyd commented 1 month ago

This seems funky & has not been addressed, but the right approach requires more thought than a bug-scrub session allows for.

branlwyd commented 1 month ago

The requirement to check batch size at time of collection job creation was removed in https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/pull/484. IMO, we should remove this check altogether -- it is inaccurate, and the "real" check is in the collection job driver (which looks to the number of successfully-aggregated reports).

divviup / janus

Collection job creation: batch size check counts client reports, rather than successful aggregations #1414