cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

Backup job stays in 'running' status when using 'detached' option and timezone is ahead of UTC #123207

Closed smcvey closed 1 month ago

smcvey commented 6 months ago

Describe the problem

A backup job remains in 'running' status indefinitely.

> show job 964456310328033281;                                                                                                                                        
        job_id       | job_type |                                      description                                      | statement | user_name | status  | running_status |        created         | started | finished |        modified        | fraction_completed | error | coordinator_id | trace_id | execution_errors
---------------------+----------+---------------------------------------------------------------------------------------+-----------+-----------+---------+----------------+------------------------+---------+----------+------------------------+--------------------+-------+----------------+----------+-------------------
  964456310328033281 | BACKUP   | BACKUP INTO '/2024/04/29-135827.97' IN 'nodelocal://1/backup' WITH OPTIONS (detached) |           | root      | running | NULL           | 2024-04-29 14:58:27+00 | NULL    | NULL     | 2024-04-29 13:58:27+00 |                  0 |       |              1 |        0 | NULL

To Reproduce

Started a Cockroach cluster, I used both v23.2.latest and v24.1.0-beta.2. Connect to the cluster with DBeaver and then run:

BACKUP INTO 'nodelocal://1/backup' WITH detached;

The job is created in the table, it has an id and is in 'running'. There appears to be no further activity with the job. There are no files written to the filesystem. On my cluster, the backup should be near-instantaneous as it's a new, empty cluster.

This behaviour happens whether or not the 'Auto-commit' DBeaver connection option is enabled or not.

The number of nodes in the cluster also doesn't seem to matter, I tested both a single node cluter and a multi-node cluster.

Pausing and then resuming the job just puts it back to 'running' but it still makes no progress. Restarting the coordinate node also has no effect.

Running cockroach debug job-trace <job id> produces an empty file.

Expected behavior I'd expect the job to begin writing backup files almost immediately.

Jira issue: CRDB-38245

dt commented 6 months ago

How long, low-bound, is "indefinitely" here?

I ask since detached jobs can take up to a minute to be started, unlike non-detached jobs which are started as soon as the transaction that creates them commits.

smcvey commented 6 months ago

How long, low-bound, is "indefinitely" here?

I waited around 30 minutes before getting bored and giving up. This wasn't like a regularly detached job where the coordinator id is initially NULL but then gets assigned and completes. The job had a coordinator id but wasn't doing anything.

blathers-crl[bot] commented 6 months ago

cc @cockroachdb/disaster-recovery

smcvey commented 6 months ago

Further testing shows this is timezone related. These are two jobs, first was created in cockroach sql and completed immediately. The second is from DBeaver. The jobs were created within the same minute but have an hour difference in start time:

root@localhost:26257/defaultdb> show job 964730286257078273;                                                                                                                                        
        job_id       | job_type |                                      description                                      | statement | user_name |  status   | running_status |            created            |            started            |           finished            |           modified            | fraction_completed | error | coordinator_id |      trace_id       |           last_run            |           next_run            | num_runs | execution_errors
---------------------+----------+---------------------------------------------------------------------------------------+-----------+-----------+-----------+----------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+--------------------+-------+----------------+---------------------+-------------------------------+-------------------------------+----------+-------------------
  964730286257078273 | BACKUP   | BACKUP INTO '/2024/04/30-131158.79' IN 'nodelocal://1/backup' WITH OPTIONS (detached) |           | root      | succeeded | NULL           | 2024-04-30 13:11:58.790343+00 | 2024-04-30 13:12:00.238832+00 | 2024-04-30 13:12:00.993128+00 | 2024-04-30 13:12:00.993136+00 |                  1 |       |              2 | 4386999847817096165 | 2024-04-30 13:12:00.238833+00 | 2024-04-30 13:12:30.238833+00 |        1 | NULL
(1 row)

Time: 20ms total (execution 20ms / network 0ms)

root@localhost:26257/defaultdb> show job 964730319720349697;                                                                                                                                        
        job_id       | job_type |                                      description                                      | statement | user_name | status  | running_status |            created            | started | finished |           modified            | fraction_completed | error | coordinator_id | trace_id | last_run |           next_run            | num_runs | execution_errors
---------------------+----------+---------------------------------------------------------------------------------------+-----------+-----------+---------+----------------+-------------------------------+---------+----------+-------------------------------+--------------------+-------+----------------+----------+----------+-------------------------------+----------+-------------------
  964730319720349697 | BACKUP   | BACKUP INTO '/2024/04/30-131209.00' IN 'nodelocal://1/backup' WITH OPTIONS (detached) |           | root      | running | NULL           | 2024-04-30 14:12:09.001681+00 | NULL    | NULL     | 2024-04-30 13:12:09.001681+00 |                  0 |       |           NULL |        0 | NULL     | 2024-04-30 14:12:09.001681+00 |        0 | NULL
(1 row)

Waiting for that one extra hour does allow the job to eventually run and succeed.

Internal discussion shows that this is reproducable without DBeaver:

set time zone '+1';
backup to 'nodelocal://1/nodelocal' with detached;

Based on the fact it's not specific to DBeaver, I'll remove 'DBeaver' from the title.

A workaround in the meantime is to set DBeaver's time zone from 'Default' to UTC.

benbardin commented 3 months ago

A workaround exists - changing the session offset in dbeaver - and the fix is more complex than we expected. Downgrading to P3.