cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

kv: index backfill never completes with ReplAC apply_to_all #135343

Open andrewbaptist opened 2 hours ago

andrewbaptist commented 2 hours ago

Describe the problem

While running the backfill test, the backfill hung and nodes became stuck waiting for a snapshot.

To Reproduce

Run the following test:

PERTURBATION_OVERRIDE=acMode=fullBoth roachtest run perturbation/full/backfill

Additionally this test reproduces the issue as well: https://github.com/cockroachdb/cockroach/pull/135339

Additional data / screenshots

The error in the logs is:

E241115 22:00:40.609821 23415709 kv/kvserver/queue.go:1198 ⋮ [T1,Vsystem,n3,raftsnapshot,s6,r7801/4:‹/Table/109/1/-781{9715…-7870…}›] 535505  error sending couldn't accept ‹range_id:7801 coordinator_replica:<node_id:3 store_id:6 replica_id:4 type:VOTER_FULL > recipient_replica:<node_id:12 store_id:24 replica_id:1 type:VOTER_FULL > delegated_sender:<node_id:3 store_id:6 replica_id:4 type:VOTER_FULL > term:7 first_index:11993 sender_queue_name:RAFT_SNAPSHOT_QUEUE descriptor_generation:95 queue_on_delegate_len:-1 snap_id:9e4c2549-8a9c-4d99-8d92-99594f668bd8 ›: (n12,s24):1: remote couldn't accept snapshot 9e4c2549 at applied index 11993: ‹snapshot intersects existing range; initiated GC:› [n12,s24,r7924/4:‹/Table/109/1/-78{2340…-1418…}›] (incoming ‹/Table/109/1/-781{9715178531312532-7870688572937416}›)

This repeats at a high rate (~100/s)

Cluster link

Jira issue: CRDB-44457

blathers-crl[bot] commented 2 hours ago

Hi @andrewbaptist, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.