cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.96k stars 3.79k forks source link

sql: optimize query for find-running-jobs-of-type #107405

Open mgartner opened 1 year ago

mgartner commented 1 year ago

The RunningJobExists function issues this query to check for running auto-stats jobs:

SELECT id FROM public.jobs
WHERE job_type IN ('CREATE STATS', 'AUTO CREATE STATS')
  AND status IN ('running', 'pending', 'cancel-requested', 'pause-requested', 'reverting', 'paused')
ORDER BY created
LIMIT 1;

The query plan includes an inefficient index join:

  planning time: 750µs
  execution time: 617ms
  distribution: local
  vectorized: true
  rows read from KV: 101,520 (9.9 MiB, 25 gRPC calls)
  cumulative time spent in KV: 614ms
  maximum memory usage: 60 MiB
  network usage: 0 B (0 messages)
  sql cpu time: 84ms
  regions: aws-us-east-1

  • top-k
  │ nodes: n6
  │ regions: aws-us-east-1
  │ actual row count: 0
  │ estimated max memory allocated: 0 B
  │ estimated max sql temp disk usage: 0 B
  │ sql cpu time: 18µs
  │ order: +created
  │ k: 1
  │
  └── • filter
      │ nodes: n6
      │ regions: aws-us-east-1
      │ actual row count: 0
      │ sql cpu time: 2ms
      │ filter: status IN ('cancel-requested', 'pause-requested', 'paused', 'pending', 'reverting', 'running')
      │
      └── • index join
          │ nodes: n6
          │ regions: aws-us-east-1
          │ actual row count: 50,760
          │ KV time: 413ms
          │ KV contention time: 0µs
          │ KV rows read: 50,760
          │ KV bytes read: 7.3 MiB
          │ KV gRPC calls: 24
          │ estimated max memory allocated: 57 MiB
          │ estimated max sql temp disk usage: 0 B
          │ sql cpu time: 72ms
          │ table: jobs@primary
          │
          └── • scan
                nodes: n6
                regions: aws-us-east-1
                actual row count: 50,760
                KV time: 201ms
                KV contention time: 0µs
                KV rows read: 50,760
                KV bytes read: 2.7 MiB
                KV gRPC calls: 1
                estimated max memory allocated: 2.7 MiB
                sql cpu time: 11ms
                missing stats
                table: jobs@jobs_job_type_idx
                spans: [/'AUTO CREATE STATS' - /'AUTO CREATE STATS'] [/'CREATE STATS' - /'CREATE STATS']

This can cause excessive resource utilization when there are many jobs in the jobs table and the query is run frequently. There's a few potential ways we can make this more efficient:

  1. Omit ORDER BY created from the query plan. This should make the optimizer prefer a full table scan, which, as long as the table isn't HUGE, should be faster than the plan with the index join.
  2. Add an index on (job_type, status) and omit ORDER BY created from the query.
  3. Add an index on (job_type, status) STORING (created) to avoid the index join.

Jira issue: CRDB-30024

mgartner commented 1 year ago

Removing the ORDER BY created may not be possible due to this: https://github.com/cockroachdb/cockroach/blob/92b05ab40ecf023b9809593e65f0f20302d1a82e/pkg/jobs/utils.go#L86C1-L90.

But it might be ok to do so when ignoreJobID is jobspb.InvalidJobID.

mgartner commented 1 year ago

I'm applying the release-blocker label so this won't get lost. I really think we need to do something here before the release.

blathers-crl[bot] commented 1 year ago

Hi @mgartner, please add branch-* labels to identify which branch(es) this release-blocker affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

michae2 commented 12 months ago

[triage] we discussed during our team meeting today, plan is to merge this PR with an additional query hint, and then make the index change in 24.1.

mgartner commented 11 months ago

107589 has been merged so I'm removing the release-blocker label and moving this to the 24.1 bucket.

mgartner commented 3 months ago

We think that the improvements in #107589 could be sufficient. Moving to the backlog until we find evidence that more work is needed here.