divviup / janus

Experimental implementation of the Distributed Aggregation Protocol (DAP) specification.
Mozilla Public License 2.0
53 stars 15 forks source link

Optimize SQL queries #933

Closed tgeoghegan closed 1 year ago

tgeoghegan commented 1 year ago

@divergentdave did some performance testing and found that:

We should look into optimizing these queries.

https://isrg.slack.com/archives/C0167LT4C73/p1674490626272259

tgeoghegan commented 1 year ago

Previously: https://github.com/divviup/janus/issues/519

branlwyd commented 1 year ago

Conflicts aren't necessarily bad (they are an expected detail of the "serializable" transaction isolation level), but the 30% conflicts on helper transactions is definitely unexpected to me. The helper mainly performs reads & writes based on requests from the leader, which should be sending only one request for a given aggregation job/collection job at a time. If we are indeed receiving only one request per job at a time, we should investigate why the conflicts are occurring -- I would not expect requests for different jobs to inherently overlap. (Maybe something along the lines of a too-wide predicate lock?)

divergentdave commented 1 year ago

In the helper's database, statements from the following methods appear most frequently in a sample of conflict error messages.

The error message was almost always "could not serialize access due to read/write dependencies among transactions", though there were a small handful of "could not serialize access due to concurrent update". Some errors with "current transaction is aborted, commands ignored until end of transaction block" also showed up, and may skew the above numbers for individual statements. The detail field had the following reason code messages, in decreasing order of frequency.

I tried shutting off the helper's aggregation job creator, aggregation job driver, and collect job driver components. This had no effect on the aggregator's transaction error rate, confirming that we're just dealing with intra-component transaction conflicts.

The statements I saw in transaction conflict errors are clearly skewed towards those that get executed the most often, once per report rather than once per aggregation job. As conflicts are a decidedly non-local phenomenon, this leaves out a lot of information. I'd like to see a sample of conflict graphs that led to these errors, but I don't know how practical that is. Here are a few hypotheses of what could be causing elevated conflict errors.

I tried increasing the min_aggregation_job_size on the leader, to get bigger aggregation jobs at a lower rate, and that did improve the conflict rate significantly. I have 30 reports per second being uploaded. I increased min_aggregation_job_size from 1 to 100, and this dropped the rate of aggregation jobs from 1 per second to 0.3 per second. The leader was not negatively impacted by this, as job step time approximately doubled. The helper's transaction error rate dropped from 33% to 0%. It's hard to pin this effect on one mechanism, since it has an across-the-board effect on the work the helper is doing, but the first hypothesis certainly fits. The leader's aggregation job creator transaction error rate also benefited, this was 26% and rising before, and it dropped to 7% (too noisy to tell if it's still climbing yet).

For now, the load test's configuration could clearly use some tuning, and we've got some directions for further investigation.

branlwyd commented 1 year ago

This was effectively solved by #1037 & #1038.