Closed smcvey closed 1 month ago
How long, low-bound, is "indefinitely" here?
I ask since detached
jobs can take up to a minute to be started, unlike non-detached jobs which are started as soon as the transaction that creates them commits.
How long, low-bound, is "indefinitely" here?
I waited around 30 minutes before getting bored and giving up. This wasn't like a regularly detached job where the coordinator id is initially NULL but then gets assigned and completes. The job had a coordinator id but wasn't doing anything.
cc @cockroachdb/disaster-recovery
Further testing shows this is timezone related. These are two jobs, first was created in cockroach sql
and completed immediately. The second is from DBeaver. The jobs were created within the same minute but have an hour difference in start time:
root@localhost:26257/defaultdb> show job 964730286257078273;
job_id | job_type | description | statement | user_name | status | running_status | created | started | finished | modified | fraction_completed | error | coordinator_id | trace_id | last_run | next_run | num_runs | execution_errors
---------------------+----------+---------------------------------------------------------------------------------------+-----------+-----------+-----------+----------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+--------------------+-------+----------------+---------------------+-------------------------------+-------------------------------+----------+-------------------
964730286257078273 | BACKUP | BACKUP INTO '/2024/04/30-131158.79' IN 'nodelocal://1/backup' WITH OPTIONS (detached) | | root | succeeded | NULL | 2024-04-30 13:11:58.790343+00 | 2024-04-30 13:12:00.238832+00 | 2024-04-30 13:12:00.993128+00 | 2024-04-30 13:12:00.993136+00 | 1 | | 2 | 4386999847817096165 | 2024-04-30 13:12:00.238833+00 | 2024-04-30 13:12:30.238833+00 | 1 | NULL
(1 row)
Time: 20ms total (execution 20ms / network 0ms)
root@localhost:26257/defaultdb> show job 964730319720349697;
job_id | job_type | description | statement | user_name | status | running_status | created | started | finished | modified | fraction_completed | error | coordinator_id | trace_id | last_run | next_run | num_runs | execution_errors
---------------------+----------+---------------------------------------------------------------------------------------+-----------+-----------+---------+----------------+-------------------------------+---------+----------+-------------------------------+--------------------+-------+----------------+----------+----------+-------------------------------+----------+-------------------
964730319720349697 | BACKUP | BACKUP INTO '/2024/04/30-131209.00' IN 'nodelocal://1/backup' WITH OPTIONS (detached) | | root | running | NULL | 2024-04-30 14:12:09.001681+00 | NULL | NULL | 2024-04-30 13:12:09.001681+00 | 0 | | NULL | 0 | NULL | 2024-04-30 14:12:09.001681+00 | 0 | NULL
(1 row)
Waiting for that one extra hour does allow the job to eventually run and succeed.
Internal discussion shows that this is reproducable without DBeaver:
set time zone '+1';
backup to 'nodelocal://1/nodelocal' with detached;
Based on the fact it's not specific to DBeaver, I'll remove 'DBeaver' from the title.
A workaround in the meantime is to set DBeaver's time zone from 'Default' to UTC.
A workaround exists - changing the session offset in dbeaver - and the fix is more complex than we expected. Downgrading to P3.
Describe the problem
A backup job remains in 'running' status indefinitely.
To Reproduce
Started a Cockroach cluster, I used both v23.2.latest and v24.1.0-beta.2. Connect to the cluster with DBeaver and then run:
The job is created in the table, it has an id and is in 'running'. There appears to be no further activity with the job. There are no files written to the filesystem. On my cluster, the backup should be near-instantaneous as it's a new, empty cluster.
This behaviour happens whether or not the 'Auto-commit' DBeaver connection option is enabled or not.
The number of nodes in the cluster also doesn't seem to matter, I tested both a single node cluter and a multi-node cluster.
Pausing and then resuming the job just puts it back to 'running' but it still makes no progress. Restarting the coordinate node also has no effect.
Running
cockroach debug job-trace <job id>
produces an empty file.Expected behavior I'd expect the job to begin writing backup files almost immediately.
Jira issue: CRDB-38245