Configure how many concurrent aggregation jobs are run per-peer-aggregator

tgeoghegan commented 8 months ago

Currently, Janus has configuration knobs that govern how often a Janus leader will schedule aggregation jobs, how often it will look for aggregation jobs to run and how many aggregation jobs it'll run at once. However these settings are global to the entire aggregator. Each DAP task can be configured with a different peer aggregator URL, and different aggregators will have different scales and performance characteristics. Janus leaders need to be able to throttle how many requests they're making to a given helper so that smaller helpers won't get overwhelmed with traffic.

Currently, Janus doesn't really have a concept of another aggregator, beyond the peer_aggregator_endpoint in a task. So from Janus' perspective, this might mean setting per-task limits on aggregation job concurrency and rate. But we could also choose to introduce a peer aggregator concept to Janus, which would map more neatly to the object model in divviup-api (which at least helps us at Divvi Up if not necessarily everyone deploying Janus).

branlwyd commented 8 months ago

A related idea: we may even want to make this per-batch, i.e. make some attempt to limit the number of concurrently-running aggregation jobs per batch. This is because writing data back to a batch as part of handling an aggregation job is (currently) write-contentious; if we spread work across batches, we would be less likely to contend.

inahga commented 8 months ago

This sounds like an inversion of a rate limiting model, where the leader aggregator is responsible for rate limiting itself to prevent overloading the helper.

Should we instead introduce rate limiting for the helper? Doesn't necessarily have to be code in Janus, but something that is easily deployable and tunable?

tgeoghegan commented 8 months ago

Divvi Up made a decision to implement rate limiting outside of Janus (instead doing so in our surrounding infrastructure), and I think that was the right call, because (1) it allows Janus to focus on what Janus does and (2) it enables deployments to choose something that suits their needs, such as a rate limiting story they already have. The downside as you note is that we don't get to assume that helpers have sensible rate limits in place. I think that if we compare the level of effort of implementing limits on how many requests we send on the leader side against designing, building and maintaining a one-size-fits-all rate limiting story, the former will be less work to do. Plus, even if we did commit to making a nice, canned rate limiting solution for helpers, we'd still have to account for helpers that can't or won't deploy that RL solution, which puts us back into the position of having to implement send-side limits.

tgeoghegan commented 8 months ago

Let me rephrase the above: the question is not "should we build rate limiting for everyone?" The question is, "can Janus assume that all helpers always have a rate limiter on them?" And I think the answer is no.

tgeoghegan commented 7 months ago

Assigning to @inahga as I recall you saying you wanted to dequeue this soon.

inahga commented 1 week ago

This is still necessary, but I'm not actively working on it right now.

divviup / janus

Configure how many concurrent aggregation jobs are run per-peer-aggregator #2482