After Full Restore - warning: error retrieving the database name: pq: database "defaultdb" does not exist

wzrdtales commented 3 years ago

Describe the problem

After a full backup from one cluster and a full restore to a fresh new cluster. The tables get restored, but a lot of calls are broken. Such as show database will produce

ERROR: referenced database ID 51: descriptor is a *tabledesc.immutable: unexpected descriptor type
warning: error retrieving the database name: pq: database "defaultdb" does not exist

I have to suspect that feature is somehow broken? I am still investigating...

Ok looked further into it, this seems most likely a bug, or straightforward not working at all. Tried multiple backups on a freshly initialized cockroachdb instance. So full backup with full restore, doesn't work as advertised.

Environment:

CockroachDB version 21.1.2

gz#8952

Epic CRDB-8816

Jira issue: CRDB-8165

blathers-crl[bot] commented 3 years ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@cockroachdb/bulk-io (found keywords: backup,restore)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

wzrdtales commented 3 years ago

just had this another time, different database, same scenario, just full backup and full restore. backup restore seems to be fully fully broken...

pbardea commented 3 years ago

Hi @wzrdtales -- thanks for filing this issue -- sorry for the late reply here!

I spent some time trying to reproduce the error message that you're seeing without any success. Based on the error message, there looks to be a bug around bad interaction with the 2 default databases initialized in the cluster (defaultdb and postgres, which have IDs 50 and 51 respectively) and performing a cluster restore.

A few questions:

Could you confirm the command that you're using to perform the backup and restore? Backups of the entire cluster can be restored either in full (RESTORE FROM 's3://my/backup/location') or by database (RESTORE my_database FROM 's3://my_backup_location'). It sounds like you're using the former, but I wanted to check if restoring individual databases also runs into this issue.
Were any changes performed to the 2 default databases initialized on either the cluster (defaultdb and postgres) that is taking the backups or performing the restores? Events like dropping these databases, or renaming them would be useful to note.
Are both the backing up and restoring cluster running 21.1.2?

If you have a particular backup that can reproduce this issue that would also help significantly get to a root cause. You can send data securely by opening a ticket at https://support.cockroachlabs.com/hc/en-us. Please mention my username so that it can get routed to me. If you would prefer not including any of the table data, feel free to delete the .sst files before uploading since I am mainly interested in the manifest files included in the backups. Otherwise, if you are able to easily reproduce this with a set of SQL statements that would also be very useful.

Thank you!

wzrdtales commented 3 years ago

hey @pbardea

I did a full restore and did fall back after that failed to single db restores to not have broken database which is missing the defaultdb which is impossible to recreate afterwards. So yes, the former FULL is broken.
no no such things as touching those dbs at all. they did exist on the former dbs. the only thing both dbs shared, I had to alter old tables that made use of interleaving before taking the backup to be able to take the backup.
they were both at the latest version at their time. so the last one that i tried a restore was at 21.1.3 the one before at 21.1.2.

I do have a backup available that is broken, but I will have to make sure that no data is included. So I will have to check once if deleting the .sst files is enough to accomplish that.

igremmerlb commented 3 years ago

We are experiencing an issue that may or may not be related. We can run BACKUP, but then RESTORE errors with: importing 20 ranges: pebble/table: invalid table (file size is too small)

This is on a new database with maybe 1-2 tables and less than 1MB total.

This works fine on the 20.2 series, but is failing in 21.1. Not sure if something broke. I notice in the 21.1 series that HTTP Storage is not supported anymore. Is this support being permanently removed?

wzrdtales commented 3 years ago

@pbardea not sure it reached you, so additionally here https://support.cockroachlabs.com/hc/en-us/requests/8952

wzrdtales commented 3 years ago

Follow up @pbardea

adityamaru commented 3 years ago

@wzrdtales apologies for the delay, I'll be taking this over from Paul. I'll update the issue with an update soon!

adityamaru commented 3 years ago

@wzrdtales would you be able to run the following command against the cluster that you took the backup from if it is still running, and paste the output here?

./cockroach debug doctor examine cluster --insecure --url <postgres://...>

wzrdtales commented 3 years ago

No sorry you're bit late to the party. This cluster is long dead and gone. Only the new corrupted one exists with that failure. I actually send over one of those backups (without data) to you guys over your support channel.

wzrdtales commented 3 years ago

But anyways. I had this problem on three different clusters, so you can just spawn any cluster you want with an older version than the current minor release and reproduce it.

wzrdtales commented 3 years ago

so basically I never had any cluster that did not produce a backup that leaves a cluster restored from it in this corrupted state, and all of those are very different environments. Only similarity: They backed up to a S3 protocol driven store.

adityamaru commented 3 years ago

I inspected the backup yesterday and it has some abnormalities which is why I was hoping to run the doctor against the backing up cluster. Most notably it seems to have user databases in two IDs that are reserved for non-user databases which is throwing the restore off.

We have two databases defaultdb and postgres that are created on node startup. We expect these to occupy static IDs 50 and 51in a cluster. The backup suggests that the cluster had other user DBs at these IDs which is very unexpected. It seems like defaultdb and postgres were dropped on the backing up cluster and then recreated too. Were there any commands to alter descriptors in system.descriptors that were run against the backing up cluster?

wzrdtales commented 3 years ago

no never. and those dbs did existed on the old cluster. When I used the restore I did when moving the cluster from one kubernetes cluster to a new one. Only the new cluster had this phenomenon.

wzrdtales commented 3 years ago

and as said multiple clusters, all behave the same. (have all been full backups and full restores)

adityamaru commented 3 years ago

To give you some more context about what I am seeing, it is not the cluster being restored into that is the issue, but is in fact the backup file we are restoring from that appears corrupted. So if you were to perform a cluster restore into any new cluster, from the backup which you gave us, we expect it to end up in a corrupted state. Some of the descriptors in the backup files are breaking invariants we rely on in CRDB.

Now for some workarounds:

Would you be willing to spawn a new cluster, and perform a database restore of all the user databases in this backup? I can provide you with a list of all the database names from the provided backup via a private channel. Once we get all the databases and tables into the new cluster, we can take a fresh, full cluster backup that should be restorable to any new cluster in the future.

wzrdtales commented 3 years ago

with user you mean the user specific ones? in general sure can do that. but I don't think this will help to avoid this problem in the future. Since I was able to reproduce it so easily in many very different environments, there seems to be something wrong, or at least was wrong in some of the earlier versions.

adityamaru commented 3 years ago

Was the reproduction with restoring the same BACKUP file as you provided us, or did you take a fresh cluster backup on the new cluster, and try to restore that? We test our backup and restore functionality quite extensively so it is surprising if you are seeing this behavior in a new cluster with a fresh backup.

adityamaru commented 1 year ago

Closing since the issue is quite old, please feel free to reopen if this is still an issue on later versions of cockroach.

wzrdtales commented 1 year ago

reopen. this was never resolved. and everytime we restore a db we still see lots of problems. Probably the whole restore is bunked and not really battle tested yet...

wzrdtales commented 1 year ago

to extend on this comment:

i.e. permissions are fluked always (no matter if full restore, or just a db restore). when a db gets restored functions like grant all do not work on this new restored database without explicitly targeting schemas as well.

adityamaru commented 1 year ago

Thanks for pinging the issue, I'm happy to take another stab at this. I realize this has been open for a while but can I ask for:

Cockroach Version
Reproductions steps
A debug zip if you have a cluster that is in an unexpected condition - https://www.cockroachlabs.com/docs/stable/cockroach-debug-zip.html
The backup manifest if you have an unrestoreable backup.

This can all be shared via https://support.cockroachlabs.com/hc/en-us.

Reading the back scroll it looks like there were a few issues you were running into so I'm hoping we can work through them one at a time.

cockroachdb / cockroach

After Full Restore - warning: error retrieving the database name: pq: database "defaultdb" does not exist #66655