Optimize SQL queries - Githubissues

tgeoghegan commented 1 year ago

@divergentdave did some performance testing and found that:

some queries yield a high rate of conflicts, requiring rollback, incurring wasted work
some queries start to run significantly more slowly as data accumulates

We should look into optimizing these queries.

https://isrg.slack.com/archives/C0167LT4C73/p1674490626272259

tgeoghegan commented 1 year ago

Previously: https://github.com/divviup/janus/issues/519

branlwyd commented 1 year ago

Conflicts aren't necessarily bad (they are an expected detail of the "serializable" transaction isolation level), but the 30% conflicts on helper transactions is definitely unexpected to me. The helper mainly performs reads & writes based on requests from the leader, which should be sending only one request for a given aggregation job/collection job at a time. If we are indeed receiving only one request per job at a time, we should investigate why the conflicts are occurring -- I would not expect requests for different jobs to inherently overlap. (Maybe something along the lines of a too-wide predicate lock?)

divergentdave commented 1 year ago

In the helper's database, statements from the following methods appear most frequently in a sample of conflict error messages.

2000: put_report_aggregation(), makes narrow lookups in tasks, client_reports, and aggregation_jobs, and writes a row to report_aggregations
1400: get_aggregate_share_jobs_including_time(), does a lookup by bigint equality and tsrange containment on aggregate_share_jobs, which we know gets suboptimal query plans
1300: put_report_share(), makes a narrow lookup in tasks, and writes a row to client_reports
700: check_report_share_exists(), makes narrow lookups on tasks and client_reports
200: update_aggregation_job(), makes a narrow lookup on tasks, and updates a row in aggregation_jobs
50: COMMIT
50: get_batch_aggregation(), makes narrow lookups on tasks and batch_aggregations
30: update_batch_aggregation(), makes a narrow lookup on tasks, updates a row of batch_aggregations
15: update_report_aggregation(), makes narrow lookups on tasks, client_reports, and aggregation_jobs, and updates a row of report_aggregations.

The error message was almost always "could not serialize access due to read/write dependencies among transactions", though there were a small handful of "could not serialize access due to concurrent update". Some errors with "current transaction is aborted, commands ignored until end of transaction block" also showed up, and may skew the above numbers for individual statements. The detail field had the following reason code messages, in decreasing order of frequency.

Canceled on identification as a pivot, during conflict out checking
Canceled on identification as a pivot, during conflict in checking
Canceled on identification as a pivot, during write
Canceled on identification as a pivot, during commit attempt
Canceled on conflict out to pivot {number}, during read

I tried shutting off the helper's aggregation job creator, aggregation job driver, and collect job driver components. This had no effect on the aggregator's transaction error rate, confirming that we're just dealing with intra-component transaction conflicts.

The statements I saw in transaction conflict errors are clearly skewed towards those that get executed the most often, once per report rather than once per aggregation job. As conflicts are a decidedly non-local phenomenon, this leaves out a lot of information. I'd like to see a sample of conflict graphs that led to these errors, but I don't know how practical that is. Here are a few hypotheses of what could be causing elevated conflict errors.

Subsequent "aggregate continue" requests for jobs in the same batch are overlapping, and their updates of the same batch_aggregations row require one transaction to be retried. (i.e. the only serializable order is serial order)
get_aggregate_share_jobs_including_time() may be doing a full index scan or table scan on aggregate_share_jobs, and thus taking out a relation-level predicate lock. This could conflict with put_aggregate_share_job(). However, that would only happen every five minutes with this setup, so this is likely of low impact.
The database might be combining predicate locks into coarser locks, which could then conflict with any writes to the same relation, as above. Querying the pg_locks view in a tight loop might give us visibility into this.

I tried increasing the min_aggregation_job_size on the leader, to get bigger aggregation jobs at a lower rate, and that did improve the conflict rate significantly. I have 30 reports per second being uploaded. I increased min_aggregation_job_size from 1 to 100, and this dropped the rate of aggregation jobs from 1 per second to 0.3 per second. The leader was not negatively impacted by this, as job step time approximately doubled. The helper's transaction error rate dropped from 33% to 0%. It's hard to pin this effect on one mechanism, since it has an across-the-board effect on the work the helper is doing, but the first hypothesis certainly fits. The leader's aggregation job creator transaction error rate also benefited, this was 26% and rising before, and it dropped to 7% (too noisy to tell if it's still climbing yet).

For now, the load test's configuration could clearly use some tuning, and we've got some directions for further investigation.

branlwyd commented 1 year ago

This was effectively solved by #1037 & #1038.

divviup / janus

Optimize SQL queries #933