cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.15k stars 3.81k forks source link

sql: schema change after server restart blocks once then looks like DB corruption after retry #9493

Closed knz closed 7 years ago

knz commented 8 years ago

Steps:

  1. create fresh cluster rm -rf cockroach-data; ./cockroach start --background
  2. load the example db: ./cockroach gen example-data startrek | ./cockroach sql
  3. confirm that the example db can be reloaded multiple times (can DROP then CREATE multiple times): repeat step 2 a few times.
  4. stop the server with killall cockroach - wait for the server to finish shutting down
  5. restart the server ./cockroach start --background
  6. attempt step 2 again; observe: the server hangs after SET, upon encountering the first DROP
  7. interrupt the client with Ctrl+C
  8. attempt step 2 again; observe: the first DROP fails with an error, then the client hangs (error: pq: missing fk back reference to quotes.quotes_episode_idx from episodes.primary)
  9. interrupt the client again with Ctrl+C
  10. attempt step 2 again; observe: major confusion
CREATE DATABASE
SET
pq: missing fk back reference to quotes.quotes_episode_idx from episodes.primary
pq: table "episodes" has been deleted
pq: relation "episodes" already exists
pq: table is being deleted
pq: relation "quotes" already exists
pq: table is being deleted
Error: pq: table is being deleted
Failed running "sql"

From that point forward the db cannot be used but also cannot be deleted.

knz commented 8 years ago

cc @andreimatei @dt (the issue does not occur if the table being dropped does not contain FKs)

knz commented 8 years ago

NB: after a while (a few minutes), the error disappears.

knz commented 8 years ago

After discussing with Andrei:

In any case, once the DROP has been initiated, we should care that a subsequent CREATE TABLE with the same name succeeds without error. Right now the user would see "table has been deleted" upon DROP, and then deduce that it's safe to create, but then CREATE fails with "table already exists.".

This is a catastrophic UX failure.

Related: #8497, #9318.

knz commented 8 years ago

@dt it would be great if you could expound on the explanation above to clarify where the FK message comes from? pq: missing fk back reference to quotes.quotes_episode_idx from episodes.primary this happens during a 2nd DROP.

andreimatei commented 8 years ago

I'll work on improving sql draining to release those leases... I'm real scared.

vivekmenezes commented 8 years ago

@andreimatei are you really working on releasing those leases. I don't think we should do that for 1.0

knz commented 8 years ago

Vivek I'm not sure whether releasing the leases are the best way to address the issue at hand but the UX problem initially reported here must absolutely be fixed before 1.0 IMO.

andreimatei commented 8 years ago

(I've replied this by email 2 days ago, but for some reason it didn't make it in the issue... weird...)

I'm not working on it; I gave up after investigating draining a bit and getting in the weeds and realizing it's not very easy and I started bringing it up with Peter and others for getting someone to work on it :) But I think we definitely need to do something about it soon. It has come up multiple times from external people in the context of not being able to delete tables.

petermattis commented 8 years ago

Draining table leases when a server is stopped doesn't sound too bad, but then I haven't looked at these weeds recently. Alternately, can we clear any previously held table leases when a server restarts?

Re: double DROP of a table. All of the proposed UX options sound reasonable. This sounds like the easiest short term fix:

the error message could be clarified, e.g. "cannot DROP this table, another DROP is already in progress"

andreimatei commented 8 years ago

I'm not working on it; I gave up after investigating draining a bit and realizing it's not very easy and I started bringing it up with Peter and others for getting someone to work on it :) But I think we definitely need to do something about it soon. It has come up multiple times from external people in the context of not being able to delete tables.

On Fri, Oct 21, 2016 at 5:36 PM, vivekmenezes notifications@github.com wrote:

@andreimatei https://github.com/andreimatei are you really working on releasing those leases. I don't think we should do that for 1.0

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/9493#issuecomment-255471250, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXBcYlUUwI-IgcT0xqJHO606ESecugVks5q2TBogaJpZM4KDYRw .

vivekmenezes commented 8 years ago

With all our schema changes being chunked now we certainly can reduce the timeout to perhaps 1 minute. That along with better error messages should suffice for 1.0

On Sun, Oct 23, 2016, 2:44 PM Andrei Matei notifications@github.com wrote:

(I've replied this by email 2 days ago, but for some reason it didn't make it in the issue... weird...)

I'm not working on it; I gave up after investigating draining a bit and getting in the weeds and realizing it's not very easy and I started bringing it up with Peter and others for getting someone to work on it :) But I think we definitely need to do something about it soon. It has come up multiple times from external people in the context of not being able to delete tables.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/9493#issuecomment-255606122, or mute the thread https://github.com/notifications/unsubscribe-auth/ALOpBJAtYiNcPJbdiUxBUQmPGPQyA8nZks5q26sQgaJpZM4KDYRw .

vivekmenezes commented 8 years ago

@knz do let me know what you'd like to see besides what I've done in https://github.com/cockroachdb/cockroach/pull/10063

On Sat, Oct 22, 2016 at 4:41 AM kena notifications@github.com wrote:

Vivek I'm not sure whether releasing the leases are the best way to address the issue at hand but the UX problem initially reported here must absolutely be fixed before 1.0 IMO.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/9493#issuecomment-255515524, or mute the thread https://github.com/notifications/unsubscribe-auth/ALOpBGwT8nSUeEpTJXMPbvIU2WFoOOEiks5q2cw-gaJpZM4KDYRw .

dt commented 7 years ago

I think we want an early return in Validate, before checking the cross-table references, if the table is being dropped (as it is then expected that back-references do not exist).