cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30k stars 3.79k forks source link

Logical database is `offline: restoring` even after waiting for restore job to complete #125788

Open ecordell opened 3 months ago

ecordell commented 3 months ago

Describe the problem

For a few seconds after a (detached) restore job completes, the database being restored won't accept any queries and returns errors like:

ERROR: database "mydbname" is offline: restoring (SQLSTATE XXUUU)

My expectation was that if the restore job reports success, the database would be available for querying.

To Reproduce

Running a job in kubernetes that performs the following steps to take a snapshot of one database and restore it to another:

  1. BACKUP DATABASE sourcedb INTO 's3://...' AS OF SYSTEM TIME '-10s' WITH DETACHED;
  2. SHOW JOB WHEN COMPLETE (WITH x as (SHOW JOBS) SELECT job_id FROM x WHERE job_type = 'BACKUP' ORDER BY created DESC LIMIT 1);
  3. RESTORE DATABASE sourcedb FROM LATEST IN 's3://...' WITH new_db_name = "targetdb-new", DETACHED;
  4. SHOW JOB WHEN COMPLETE (WITH x as (SHOW JOBS) SELECT job_id FROM x WHERE job_type = 'RESTORE' ORDER BY created DESC LIMIT 1);
    • This blocks before returning and shows the job in the succeeded state: 978361979530477569,RESTORE,"RESTORE DATABASE targetdb FROM 's3://... WITH OPTIONS (detached, new_db_name = 'targetdb-new')",,...,succeeded,NULL,2024-06-17 16:46:21.075433+00,2024-06-17 16:46:45.504433+00,2024-06-17 16:46:46.789028+00,2024-06-17 16:46:46.789071+00,1,,1,3775933280525936069,2024-06-17 16:46:45.504433+00,2024-06-17 16:47:15.504433+00,1,NULL
  5. ALTER DATABASE targetdb RENAME TO "targetdb-old";
  6. ALTER DATABASE "targetdb-new" RENAME TO targetdb;
  7. DROP DATABASE "targetdb-old" CASCADE;

Separately, I have an application making queries against targetdb running continuously with a low level of traffic. I can reproducibly see a few seconds of errors that look like this:

ERROR: database "targetdb" is offline: restoring (SQLSTATE XXUUU)

Expected behavior I expected this operation to make the database unavailable for a bit, but I didn't expect to see errors that the restore was still in progress - there is no state where queries should have been hitting the database while being restored, since I'm blocking on restore job completion with SHOW JOB WHEN COMPLETE. This makes me think there is a bug in reporting the succeeded status of a restore job?

Environment:

Additional Info As an aside, I would also love to find a way to make a one-liner to both create a detached backup/restore job and wait for its completion (I want to wait in the normal case, but I want it to continue running in the background if the frontend dies for some reason).

Jira issue: CRDB-39610

blathers-crl[bot] commented 3 months ago

Hi @ecordell, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] commented 3 months ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] commented 3 months ago

cc @cockroachdb/disaster-recovery

msbutler commented 3 months ago

@ecordell thanks for filing this report. i'm unable repro your bug in this unit test I modified. Could you verify that my test accurately reflects your repro steps? Also could you verify that the queries are failing after the restore completes?

ecordell commented 3 months ago

Thanks for looking into it @msbutler!

I think the main difference in the test case you shared is the lack of parallelism.

I think the equivalent would be running this part of the test in one goroutine:

    sqlDB.Exec(t, "RESTORE DATABASE fkdb FROM $1 WITH new_db_name = 'new_fkdb', detached", localFoo)
    sqlDB.Exec(t, `SHOW JOB WHEN COMPLETE (WITH x as (SHOW JOBS) SELECT job_id FROM x WHERE job_type = 'RESTORE' ORDER BY created DESC LIMIT 1)`)
    sqlDB.Exec(t, `ALTER DATABASE fkdb RENAME TO "old_fkdb"`)
    sqlDB.Exec(t, `ALTER DATABASE "new_fkdb" RENAME TO fkdb`)

But then having another goroutine that's running against the target database continously:

    sqlDB.Exec(t, `SELECT * FROM fkdb`)

Since we're waiting for the restore job to complete before renaming anything, I wouldn't expect to ever see the restoring state in the error message (I understand that some of the parallel selects may fail, but that should be related to the ALTER statements, not the restore).

msbutler commented 3 months ago

@ecordell thanks for the quick response! I begin a goroutine that continuously selects before and during the database restore here, so i think i'm capturing the parallelism you're referring to.

I'm going to try reproing this in an actual production cluster instead of a unit test. I might not get to this until early next week.

ecordell commented 3 months ago

Ah, missed that bit - sorry! Yes, that looks similar to what I'm doing. I was able to consistently repro with serverless though.

msbutler commented 3 months ago

@ecordell I will not likely get to this until later in the week. In the meantime, could you attempt to repro the bug on a dedicated cluster if you have one available?

ecordell commented 3 months ago

@msbutler Was able to run this again with dedicated.

With the same steps as above, but running between two dedicated clusters, I still see errors like this for a couple of seconds:

ERROR: database "targetdb-new" is offline: restoring (SQLSTATE XXUUU)

Note that the application is never making a request to targetdb-new - only ever targetdb - it's as if the query is hitting the DB under its old name and while still in a restoring state.

For completeness, here are all of the differences for this second test:

Otherwise, all of the steps were the same as in the initial description.

Also just to be clear: I 100% expect some downtime with this operation (and I'm happy with the ~5s that the DB is unavailable when I do this). The thing that seems odd is the reporting of the state; I wasn't sure if this indicated some deeper consistency issue around backup/restore that would be worth investigating.

stevendanna commented 3 months ago

@ecordell Just as an update, I'll be trying to reproduce locally to see if a second set of eyes helps.

I wasn't sure if this indicated some deeper consistency issue around backup/restore that would be worth investigating.

Just to confirm: We believe this is worth investigating.