cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.51k stars 3.7k forks source link

IMPORT: failure due to gc threshold subsequently fails to rollback #122351

Open dt opened 2 months ago

dt commented 2 months ago

Observed on drt-ua2, running A0AF5664: the initial import failed after several days, and now appears stuck and unable to revert:

addsstable [/Tenant/3/Table/114/1/127876/7/-3001/1/0,/Tenant/3/Table/114/1/127877/5/-2150/11/0/NULL): batch timestamp 1713131887.795927192,0 must be after replica GC threshold 1713138461.824411737,0 | 7 | 591956281271723629 | {"reverting execution from '2024-04-15 04:31:34.478337' to '2024-04-15 04:31:38.13894' on 7 failed: rolling back IMPORT INTO in empty table via DeleteRange: delete range /Tenant/3/Table/114/1/190836/1/-2999/1 - /Tenant/3/Table/114/1/216277/8/-2823/6: replica unavailable: (n2,s2):5 unable to serve request to r76626:/Tenant/3/Table/114/1/{190836/1/-2999/1-214285/8/-3001/1} [(n2,s2):5, (n1,s1):2, (n3,s3):3VOTER_DEMOTING_LEARNER, (n5,s5):6VOTER_INCOMING, next=7, gen=17148, sticky=1713076548.036832953,0]: closed timestamp: 1713076086.298572324,0 (2024-04-14 06:28:06); raft status: {\"id\":\"5\",\"term\":7,\"vote\":\"5\",\"commit\":39,\"lead\":\"5\",\"raftState\":\"StateLeader\",\"applied\":39,\"progress\":{\"3\":{\"match\":66425,\"next\":66426,\"state\":\"StateReplicate\"},\"5\":{\"match\":66425,\"next\":66426,\"state\":\"StateReplicate\"},\"6\":{\"match\":0,\"next\":37,\"state\":\"StateSnapshot\"},\"2\":{\"match\":0,\"next\":37,\"state\":\"StateSnapshot\"}},\"leadtransferee\":\"0\"}: encountered poisoned latch /M{in-ax}@0,0","reverting execution from '2024-04-15 05:03:15.042559' to '2024-04-15 05:03:18.809365' on 2 failed: rolling back IMPORT INTO in empty table via DeleteRange: delete range /Tenant/3/Table/114/1/190836/1/-2999/1 - /Tenant/3/Table/114/1/216277/8/-2823/6: replica unavailable: (n2,s2):5 unable to serve request to r76626:/Tenant/3/Table/114/1/{190836/1/-2999/1-214285/8/-3001/1} [(n2,s2):5, (n1,s1):2, (n3,s3):3VOTER_DEMOTING_LEARNER, (n5,s5):6VOTER_INCOMING, next=7, gen=17148, sticky=1713076548.036832953,0]: closed timestamp: 1713076086.298572324,0 (2024-04-14 06:28:06); raft status: {\"id\":\"5\",\"term\":7,\"vote\":\"5\",\"commit\":39,\"lead\":\"5\",\"raftState\":\"StateLeader\",\"applied\":39,\"progress\":{\"2\":{\"match\":0,\"next\":37,\"state\":\"StateSnapshot\"},\"3\":{\"match\":66425,\"next\":66426,\"state\":\"StateReplicate\"},\"5\":{\"match\":66425,\"next\":66426,\"state\":\"StateReplicate\"},\"6\":{\"match\":0,\"next\":37,\"state\":\"StateSnapshot\"}},\"leadtransferee\":\"0\"}: encountered poisoned latch /M{in-ax}@0,0","reverting execution from '2024-04-15 06:07:14.455743' to '2024-04-15 06:07:20.81839' on 3 failed: rolling back IMPORT INTO in empty table via DeleteRange: delete range /Tenant/3/Table/114/1/189154/4/-3000/1 - /Tenant/3/Table/114/1/214594/5/-1359/5: replica unavailable: (n2,s2):5 unable to serve request to r76626:/Tenant/3/Table/114/1/{190836/1/-2999/1-214285/8/-3001/1} [(n2,s2):5, (n1,s1):2, (n3,s3):3VOTER_DEMOTING_LEARNER, (n5,s5):6VOTER_INCOMING, next=7, gen=17148, sticky=1713076548.036832953,0]: closed timestamp: 1713076086.298572324,0 (2024-04-14 06:28:06); raft status: {\"id\":\"5\",\"term\":7,\"vote\":\"5\",\"commit\":39,\"lead\":\"5\",\"raftState\":\"StateLeader\",\"applied\":39,\"progress\":{\"5\":{\"match\":66425,\"next\":66426,\"state\":\"StateReplicate\"},\"6\":{\"match\":0,\"next\":37,\"state\":\"StateSnapshot\"},\"2\":{\"match\":0,\"next\":37,\"state\":\"StateSnapshot\"},\"3\":{\"match\":66425,\"next\":66426,\"state\":\"StateReplicate\"}},\"leadtransferee\":\"0\"}: encountered poisoned latch /M{in-ax}@0,0"}

The DB console reports no unavailable ranges.

Jira issue: CRDB-37824

rytaft commented 2 months ago

@yuzefovich I didn't get a chance to look at this during my on-call rotation. If you get a chance, could you please take a look? Thank you!

michae2 commented 1 month ago

(quoting @yuzefovich during triage)

We think a way to reproduce this is to:

  1. Set a short GC TTL on the range
  2. make the import take longer than the TTL,
  3. and then cause the import to fail, and it seems like that might hit this assertion
DrewKimball commented 1 month ago

@dt does it seem possible that this issue is caused by #91151?