Reverting schema change job result in cluster inconsistent state

colprog commented 2 years ago

Describe the problem

Cluster in a inconsistent state with no clear way to recover, backup/restore commands reports descriptor missing:

ERROR: failed to resolve targets specified in the BACKUP stmt: relation "Users" (77): invalid foreign key backreference: missing table=123: referenced table ID 123: descriptor not found

tried dropping the database, also fails:

ERROR: internal error: relation "Users" (77): invalid foreign key backreference: missing table=123: referenced table ID 123: descriptor not found SQLSTATE: XX000 DETAIL: stack trace: /go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/errors.go:84: init() /usr/local/go/src/runtime/proc.go:6309: doInit() /usr/local/go/src/runtime/proc.go:6286: doInit() /usr/local/go/src/runtime/proc.go:6286: doInit() /usr/local/go/src/runtime/proc.go:6286: doInit() /usr/local/go/src/runtime/proc.go:6286: doInit() /usr/local/go/src/runtime/proc.go:6286: doInit() /usr/local/go/src/runtime/proc.go:6286: doInit() /usr/local/go/src/runtime/proc.go:6286: doInit() /usr/local/go/src/runtime/proc.go:6286: doInit() /usr/local/go/src/runtime/proc.go:6286: doInit() /usr/local/go/src/runtime/proc.go:208: main() /usr/local/go/src/runtime/asm_amd64.s:1371: goexit()

HINT: You have encountered an unexpected error.

Please check the public issue tracker to check whether this problem is already tracked. If you cannot find it there, please report the error with details by creating a new issue.

If you would rather not post publicly, please contact us directly using the support form.

We appreciate your feedback.

Now we're stuck with malfunctioning cluster, new sql connection would fail since type discovery related queries like SELECT pg_type.oid, enumlabel FROM pg_enum JOIN pg_type ON pg_type.oid=enumtypid; would also fail

To Reproduce

apply schema change, deleted a column that's in a partial index in database A
cancel the job after a few hours of waiting
reverting takes another a few hours
drop the aforementioned partial index
24 hour after step 1, the cluster started reporting issue

Expected behavior A way to recover from this state. A way to forcefully remove this database seems good enough

Additional data / screenshots If the problem is SQL-related, include a copy of the SQL query and the schema of the supporting tables.

If a node in your cluster encountered a fatal error, supply the contents of the log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue creating this issue, but contact support@cockroachlabs.com to submit the log files in private.

If applicable, add screenshots to help explain your problem.

Environment:

CockroachDB 21.2.3
Server OS: Linux
Client app: C# with dotnet ef

Additional context What was the impact?

Add any other context about the problem here.

Jira issue: CRDB-12033

blathers-crl[bot] commented 2 years ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@cockroachdb/bulk-io (found keywords: backup,restore)
@cockroachdb/sql-experience (found keywords: pg_)
@cockroachdb/sql-schema (found keywords: schema change)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/bulk-io

postamar commented 2 years ago

This misbehaviour, unfortunately, is caused by known limitations of the legacy schema changer. Reverting schema change jobs can leave the table descriptors in invalid states, effectively making the table inaccessible. That being said, I was surprised to find out that it's not possible to drop the table. This is most definitely not expected. We're looking into this.

In the long term, we're addressing this limitation by doing a massive overhaul of how schema changes are implemented. This effort is already underway but it will take a while before it bears fruit.

ajwerner commented 2 years ago

This is loosely related to https://github.com/cockroachdb/cockroach/issues/50651.

To do this we'd need some way to tell the resolution in the drop case to allow some invalid outgoing references. This is hard. For now, we'll say that we need to repair the graph.

In some cases there's a desire to drop a whole database which might be corrupted. In that case, once there are no cross-database references, we can just destroy all the descriptors and data without thinking too hard.

This will depend on cross-database reference removal. CC @postamar

ajwerner commented 2 years ago

The assignment here is to remember this discussion and the cross database reference work.

cockroachdb / cockroach

Reverting schema change job result in cluster inconsistent state #74324