cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.97k stars 3.79k forks source link

restore never end #85180

Open FeiYangtze opened 2 years ago

FeiYangtze commented 2 years ago

I have a database KDB, about 14g data. I haven't finished using restore for 14 hours. I feel it will never end. Please give me some advice image

Both local backup and S3 backup have been tried. The situation is the same

Jira issue: CRDB-1808

6060

This time, the userfile method was used for backup and recovery, and the recovery was not completed after more than 60 hours

blathers-crl[bot] commented 2 years ago

Hello, I am Blathers. I am here to help you get the issue triaged.

It looks like you have not filled out the issue in the format of any of our templates. To best assist you, we advise you to use one of these templates.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/bulk-io

adityamaru commented 2 years ago

Hi @FeiYangtze, thanks for opening an issue!

Can you run SHOW JOBS and see if the restore progress is stuck at a particular number or is progressing but slowly?

Additionally, I'd need:

To debug this further. If you'd prefer not to share the zip in public you can send it over https://support.cockroachlabs.com/hc/en-us or hop in to our community slack and message it to me.

FeiYangtze commented 2 years ago

@adityamaru Thanks for your reply!

  1. Cockroach version [root@localhost ~]# cockroach version Build Tag: v21.1.10 Build Time: 2021/10/07 02:39:03 Distribution: CCL Platform: linux amd64 (x86_64-unknown-linux-gnu) Go Version: go1.15.14 C Compiler: gcc 6.5.0 Build Commit ID: a6daab16ee8a1c15abc4c4a8a425e46d12033b5c Build Type: release ##########################################################################
  2. Cluster configuration [root@localhost ~]# cat /etc/systemd/system/cockroach.service [Unit] Description=Cockroach Database Requires=network.target

[Service] Type=notify WorkingDirectory=/var/data/cockroach ExecStart=/usr/local/bin/cockroach start-single-node --certs-dir=certs --listen-addr=192.168.56.104 --http-addr=:8864 --external-io-dir=/var/data/cockroach/cockroach-backup TimeoutStopSec=60 Restart=always RestartSec=10 StandardOutput=syslog StandardError=syslog SyslogIdentifier=cockroach User=cockroach

[Install] WantedBy=default.target

FeiYangtze commented 2 years ago
  1. SHOW JOBS; I did the same operation on the TPCC and the KDB. The restore of the TPCC was successful, but the restore of the KDB has not been completed for more than 15 hours. SHOW JOBS command is stuck. [root@localhost 09:11:20 ~]# cockroach sql --host=192.168.56.104 #

    Welcome to the CockroachDB SQL shell.

    All statements must be terminated by a semicolon.

    To exit, type: \q.

    #

    Server version: CockroachDB CCL v21.1.10 (x86_64-unknown-linux-gnu, built 2021/10/07 02:39:03, go1.15.14) (same version as client)

    Cluster ID: 675a0cf4-4ce1-4116-bd3c-5f210761eb60

    #

    Enter \? for a brief introduction.

    # root@192.168.56.104:26257/defaultdb> show databases; database_name | owner | primary_region | regions | survival_goal ----------------+-------+----------------+---------+---------------- defaultdb | root | NULL | {} | NULL system | node | NULL | {} | NULL tpcc | root | NULL | {} | NULL (3 rows)

Time: 8ms total (execution 7ms / network 0ms)

root@192.168.56.104:26257/defaultdb> show jobs;

FeiYangtze commented 2 years ago
  1. Operation record [root@localhost 17:25:06 /var/data/cockroach/cockroach-backup]# cockroach sql --host=192.168.56.104 #

    Welcome to the CockroachDB SQL shell.

    All statements must be terminated by a semicolon.

    To exit, type: \q.

    #

    Server version: CockroachDB CCL v21.1.10 (x86_64-unknown-linux-gnu, built 2021/10/07 02:39:03, go1.15.14) (same version as client)

    Cluster ID: 675a0cf4-4ce1-4116-bd3c-5f210761eb60

    #

    Enter \? for a brief introduction.

    # root@192.168.56.104:26257/defaultdb> show databases; database_name | owner | primary_region | regions | survival_goal ----------------+--------+----------------+---------+---------------- defaultdb | root | NULL | {} | NULL kdb | kadmin | NULL | {} | NULL system | node | NULL | {} | NULL tpcc | root | NULL | {} | NULL (4 rows)

Time: 1ms total (execution 1ms / network 0ms)

root@192.168.56.104:26257/defaultdb> show backups in 'nodelocal://self/tpcc-bak'; path

(0 rows)

Time: 0ms total (execution 0ms / network 0ms)

root@192.168.56.104:26257/defaultdb> show backups in 'nodelocal://self/kdb-bak'; path

(0 rows)

Time: 1ms total (execution 0ms / network 0ms)

root@192.168.56.104:26257/defaultdb> backup database tpcc into 'nodelocal://self/tpcc-bak' as of system time '-10s'; job_id | status | fraction_completed | rows | index_entries | bytes ---------------------+-----------+--------------------+--------+---------------+----------- 787172899703881729 | succeeded | 1 | 599354 | 60000 | 81240084 (1 row)

Time: 427ms total (execution 427ms / network 0ms)

root@192.168.56.104:26257/defaultdb> show backups in 'nodelocal://self/tpcc-bak'; path

/2022/08/12-092814.62 (1 row)

Time: 1ms total (execution 0ms / network 0ms)

root@192.168.56.104:26257/defaultdb> backup database kdb into 'nodelocal://self/kdb-bak' as of system time '-10s'; job_id | status | fraction_completed | rows | index_entries | bytes ---------------------+-----------+--------------------+----------+---------------+------------- 787172971711561729 | succeeded | 1 | 16354519 | 20774128 | 7463248036 (1 row)

Time: 33.505s total (execution 33.505s / network 0.000s)

root@192.168.56.104:26257/defaultdb> show backups in 'nodelocal://self/kdb-bak'; path

/2022/08/12-092836.58 (1 row)

Time: 2ms total (execution 2ms / network 0ms)

root@192.168.56.104:26257/defaultdb> drop database tpcc cascade; DROP DATABASE

Time: 423ms total (execution 423ms / network 0ms)

root@192.168.56.104:26257/defaultdb> drop database kdb cascade; DROP DATABASE

Time: 6.618s total (execution 6.618s / network 0.000s)

root@192.168.56.104:26257/defaultdb> show databases; database_name | owner | primary_region | regions | survival_goal ----------------+-------+----------------+---------+---------------- defaultdb | root | NULL | {} | NULL system | node | NULL | {} | NULL (2 rows)

Time: 1ms total (execution 1ms / network 0ms)

root@192.168.56.104:26257/defaultdb> restore database tpcc from '/2022/08/12-092814.62' in 'nodelocal://self/tpcc-bak'; job_id | status | fraction_completed | rows | index_entries | bytes ---------------------+-----------+--------------------+--------+---------------+----------- 787175157658157057 | succeeded | 1 | 599354 | 60000 | 81240084 (1 row)

Time: 1.431s total (execution 1.431s / network 0.000s)

root@192.168.56.104:26257/defaultdb> show databases; database_name | owner | primary_region | regions | survival_goal ----------------+-------+----------------+---------+---------------- defaultdb | root | NULL | {} | NULL system | node | NULL | {} | NULL tpcc | root | NULL | {} | NULL (3 rows)

Time: 1ms total (execution 1ms / network 0ms)

root@192.168.56.104:26257/defaultdb> restore database kdb from '/2022/08/12-092836.58' in 'nodelocal://self/kdb-bak';

FeiYangtze commented 2 years ago

debug.zip