Jobs: Slow jobs table query performance leads to cluster overload

Jeremyyang920 commented 2 months ago

Describe the problem

When MOLT Fetch runs for a large number of partitioned tables in the case of PG -> CRDB, each partition acts as an individual table and can generate N number of IMPORT jobs based on the size of each partition. With say 300 partitions, you would potentially have anywhere from 300 job records to 300N records.

With the way Fetch operates, it issues the query to IMPORT without using detached so it does not have knowledge of the job ID, so in order to perform a look up of the import job progress it performs a query like

WITH x as (SHOW JOBS)
SELECT description, started, fraction_completed
FROM x
WHERE job_type='IMPORT'
    AND description LIKE '%%%s.%s(%%'
    AND started > '%s'
ORDER BY created DESC

in order to find the recent record. However, as the jobs table continues to grow in number of records, this query seems to start causing an immense amount of cluster overhead. In this example the customer was showing that node 4 handling the jobs table queries was being overloaded to 95%. In this example, it is a comparison of running the jobs query versus not running it, and you can see the large difference in cluster utilization as the job table continues to grow.

The jobs table query for the customer took up to 50s and in my test reproduction it went as high as 25 seconds.

To Reproduce Create a large number of jobs table records and try to poll the table for the status of certain jobs that are running and observe the CPU usage of the cluster spike.

Expected behavior Jobs table queries should return quickly or not provide significant overload to the cluster.

Additional data / screenshots Attached is a statement bundle for a query that took ~3s or so but for what the query is doing, should have been faster given the load on the cluster.

stmt-bundle-1000509387754045441.zip

What was the impact? The node handling the jobs table query became overloaded and spiked in CPU usage which can cause issues for production applications.

Jira issue: CRDB-42053

blathers-crl[bot] commented 2 months ago

cc @cockroachdb/disaster-recovery

blathers-crl[bot] commented 2 months ago

Hi @Jeremyyang920, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

blathers-crl[bot] commented 2 months ago

Hi @benbardin, please add branch-* labels to identify which branch(es) this C-bug affects.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

benbardin commented 2 months ago

Which CRDB version?

Jeremyyang920 commented 2 months ago

@benbardin Woops. Customer was on 23.2.3. I replicated this on 24.2.

msbutler commented 2 months ago

@Jeremyyang920 could you provide more evidence that the job system is at fault here? or is your workload (i.e. the polling query) conducting an unnecessary amount of work? For example,

can you see in the trace if your stmt bundle if the query is doing full scans, bottlenecked on contention, or perhaps taking a while to evaluate description LIKE '%%%s.%s(%%'? The trace should have information on where in the stack you're bottlenecked.
in your repro, how often are you polling the job system? how many jobs are in the system when your query takes a long time?
suppose you had the jobID of each import, do you think the query would be significantly faster?

Jeremyyang920 commented 2 months ago

@msbutler Let me see what I can do. I still have my source DB with 1000 partitions on pg that should be able to replicate this.

So in the code, the polling will happen once every 10 seconds using the cockroach retry package. So it would wait for the last query to finish, see fractional_completed is not 1, then wait 10 seconds before doing the next polling check. There could typically be a maximum of table-concurrency queries asking the job table for data since each table being migrated is its own independent process basically. I think I noticed the slow down once we got into the 100's of records, but i didn't pay close attention to the count and was mostly looking at the DB Console metrics. However in this case, because we were migration N partitions into a single CRDB table, only 1 import job was active at a time otherwise other partitions would have ran into the table is offline error, thus there was only 1 query asking the jobs table for information.

If we had the jobID it should help more since I think the table lookup is able to more effectively use the jobID in a scan. The main issue is we don't run IMPORT using the detatched option since we are waiting to receive errors over the wire and such to perform our Retries. If we did run in detached, we would still need to poll the jobID but perhaps doing show job id is more efficient this problem could be alleviated. From talking with TSE though, i don't think this is the first time Jobs table queries have been slow.

cockroachdb / cockroach

Jobs: Slow jobs table query performance leads to cluster overload #130440