Closed tgeoghegan closed 1 year ago
Previously: https://github.com/divviup/janus/issues/519
Conflicts aren't necessarily bad (they are an expected detail of the "serializable" transaction isolation level), but the 30% conflicts on helper transactions is definitely unexpected to me. The helper mainly performs reads & writes based on requests from the leader, which should be sending only one request for a given aggregation job/collection job at a time. If we are indeed receiving only one request per job at a time, we should investigate why the conflicts are occurring -- I would not expect requests for different jobs to inherently overlap. (Maybe something along the lines of a too-wide predicate lock?)
In the helper's database, statements from the following methods appear most frequently in a sample of conflict error messages.
put_report_aggregation()
, makes narrow lookups in tasks, client_reports, and aggregation_jobs, and writes a row to report_aggregationsget_aggregate_share_jobs_including_time()
, does a lookup by bigint
equality and tsrange
containment on aggregate_share_jobs, which we know gets suboptimal query plansput_report_share()
, makes a narrow lookup in tasks, and writes a row to client_reportscheck_report_share_exists()
, makes narrow lookups on tasks and client_reportsupdate_aggregation_job()
, makes a narrow lookup on tasks, and updates a row in aggregation_jobsCOMMIT
get_batch_aggregation()
, makes narrow lookups on tasks and batch_aggregationsupdate_batch_aggregation()
, makes a narrow lookup on tasks, updates a row of batch_aggregationsupdate_report_aggregation()
, makes narrow lookups on tasks, client_reports, and aggregation_jobs, and updates a row of report_aggregations.The error message was almost always "could not serialize access due to read/write dependencies among transactions", though there were a small handful of "could not serialize access due to concurrent update". Some errors with "current transaction is aborted, commands ignored until end of transaction block" also showed up, and may skew the above numbers for individual statements. The detail field had the following reason code messages, in decreasing order of frequency.
I tried shutting off the helper's aggregation job creator, aggregation job driver, and collect job driver components. This had no effect on the aggregator's transaction error rate, confirming that we're just dealing with intra-component transaction conflicts.
The statements I saw in transaction conflict errors are clearly skewed towards those that get executed the most often, once per report rather than once per aggregation job. As conflicts are a decidedly non-local phenomenon, this leaves out a lot of information. I'd like to see a sample of conflict graphs that led to these errors, but I don't know how practical that is. Here are a few hypotheses of what could be causing elevated conflict errors.
get_aggregate_share_jobs_including_time()
may be doing a full index scan or table scan on aggregate_share_jobs, and thus taking out a relation-level predicate lock. This could conflict with put_aggregate_share_job()
. However, that would only happen every five minutes with this setup, so this is likely of low impact.pg_locks
view in a tight loop might give us visibility into this.I tried increasing the min_aggregation_job_size on the leader, to get bigger aggregation jobs at a lower rate, and that did improve the conflict rate significantly. I have 30 reports per second being uploaded. I increased min_aggregation_job_size from 1 to 100, and this dropped the rate of aggregation jobs from 1 per second to 0.3 per second. The leader was not negatively impacted by this, as job step time approximately doubled. The helper's transaction error rate dropped from 33% to 0%. It's hard to pin this effect on one mechanism, since it has an across-the-board effect on the work the helper is doing, but the first hypothesis certainly fits. The leader's aggregation job creator transaction error rate also benefited, this was 26% and rising before, and it dropped to 7% (too noisy to tell if it's still climbing yet).
For now, the load test's configuration could clearly use some tuning, and we've got some directions for further investigation.
This was effectively solved by #1037 & #1038.
@divergentdave did some performance testing and found that:
We should look into optimizing these queries.
https://isrg.slack.com/archives/C0167LT4C73/p1674490626272259