roachtest: restore/online/workload=true/tpcc/8TB/gce/inc-count=0/nodes=10/cpus=8 failed [job lease transfer returned error to user]

cockroach-teamcity commented 1 month ago

roachtest.restore/online/workload=true/tpcc/8TB/gce/inc-count=0/nodes=10/cpus=8 failed with artifacts on master @ 3afb6935d0ef3de7c0d44cfb3cd54f312752c186:

(monitor.go:154).Wait: monitor failure: pq: job 985379219784892417: could not mark as reverting: job 985379219784892417: with status "running": expected session "01018090ad60e2559c41e5bdc46e9cc9d64c6a" but found "010180aff75cff8ad04b25a2d50c96edf1bd8c"
test artifacts and logs in: /artifacts/restore/online/workload=true/tpcc/8TB/gce/inc-count=0/nodes=10/cpus=8/run_1

Parameters:

ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=8
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=false
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-40260}

msbutler commented 1 month ago

Online restore job started at 11:37 on node 1, took 2 minutes to issue all of the split and scatters, then the virtual sst was doomed from the start... seems like node 1 at least may have been dealing with some liveness issues.

I240712 11:39:53.763644 14810 ccl/backupccl/restore_online.go:248 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=985379219784892417›] 2273  ingesting remote files
E240712 11:39:55.726767 331 kv/kvserver/queue.go:1202 ⋮ [T1,Vsystem,n1,merge,s1,r62/1:‹/Table/6{0-1}›] 2274  needs lease, not adding: repl=(n4,s4):2 seq=2 start=1720784271.662927639,0 epo=1 min-exp=1720784277.660814131,0 pro=1720784271.665515631,0
I240712 11:39:55.726797 331 kv/kvserver/queue.go:918 ⋮ [T1,Vsystem,n1,merge,s1,r62/1:‹/Table/6{0-1}›] 2275  ‹queue.merge: processing replica [outer]›: skipping 1 since replica can't be processed needs lease, not adding: repl=(n4,s4):2 seq=2 start=1720784271.662927639,0 epo=1 min-exp=1720784277.660814131,0 pro=1720784271.665515631,0
E240712 11:40:15.728351 331 kv/kvserver/queue.go:1202 ⋮ [T1,Vsystem,n1,merge,s1,r75/2:‹/Table/108/1/227{23/1…-46/1…}›] 2276  needs lease, not adding: repl=(n4,s4):3 seq=6 start=1720784354.812339921,0 epo=1 min-exp=1720784360.812271848,0 pro=1720784354.813682523,0
I240712 11:40:15.728383 331 kv/kvserver/queue.go:918 ⋮ [T1,Vsystem,n1,merge,s1,r75/2:‹/Table/108/1/227{23/1…-46/1…}›] 2277  ‹queue.merge: processing replica [outer]›: skipping 2 since replica can't be processed needs lease, not adding: repl=(n4,s4):3 seq=6 start=1720784354.812339921,0 epo=1 min-exp=1720784360.812271848,0 pro=1720784354.813682523,0
I240712 11:40:16.294544 302 kv/kvserver/store_remove_replica.go:157 ⋮ [T1,Vsystem,n1,s1,r1/1:‹/{Min-System/NodeL…}›,raft] 2278  removing replica r1/1
I240712 11:40:16.338596 405 kv/kvserver/replica_command.go:2454 ⋮ [T1,Vsystem,n1,s1,r1/1:‹/{Min-System/NodeL…}›,*kvpb.AdminChangeReplicasRequest] 2279  skipping learner removal because it was already removed
W240712 11:40:23.967992 348 2@kv/kvserver/replica_proposal.go:638 ⋮ [T1,Vsystem,n1,s1,r5696/3:‹/Table/108/1/1352{64/3…-87/3…}›,raft] 2280  lease expired before epoch lease upgrade, client traffic may have been delayed [lease=repl=(n1,s1):3 seq=7 start=1720784420.837957585,1 epo=1 min-exp=0,0 pro=1720784423.964632265,0 prev=repl=(n10,s10):2 seq=6 start=1720784414.838028920,0 exp=1720784420.837957585,0 pro=1720784414.837957585,0 acquisition-type=‹Request›]
W240712 11:40:37.419344 203429 kv/kvclient/kvcoord/dist_sender.go:2766 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=985379219784892417›] 2281  slow replica RPC: have been waiting 10.88s (0 attempts) for RPC LinkExternalSSTable [/Table/108/1/‹121579›/‹4›/‹2294›,/Table/108/1/‹121581›/‹9›/‹208›) to replica (n10,s10):2; resp: ‹(err: <nil>), *kvpb.LinkExternalSSTableResponse›
W240712 11:40:37.474066 203451 kv/kvclient/kvcoord/dist_sender.go:2766 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=985379219784892417›] 2282  slow replica RPC: have been waiting 10.63s (0 attempts) for RPC LinkExternalSSTable [/Table/108/1/‹115590›/‹2›/‹2169›,/Table/108/1/‹115594›/‹1›/‹311›) to replica (n10,s10):2; resp: ‹(err: <nil>), *kvpb.LinkExternalSSTableResponse›
W240712 11:40:39.511583 203441 kv/kvclient/kvcoord/dist_sender.go:2766 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=985379219784892417›] 2283  slow replica RPC: have been waiting 12.78s (0 attempts) for RPC LinkExternalSSTable [/Table/108/1/‹116290›/‹9›/‹804›,/Table/108/1/‹116294›/‹7›/‹1909›) to replica (n10,s10):2; resp: ‹(err: <nil>), *kvpb.LinkExternalSSTableResponse›
W240712 11:40:39.876789 203430 kv/kvclient/kvcoord/dist_sender.go:2766 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=985379219784892417›] 2284  slow replica RPC: have been waiting 13.59s (0 attempts) for RPC LinkExternalSSTable [/Table/108/1/‹105817›/‹9›/‹697›,/Table/108/1/‹105819›/‹8›/‹1274›) to replica (n10,s10):4; resp: ‹(err: <nil>), *kvpb.LinkExternalSSTableResponse›
W240712 11:40:40.583208 203437 kv/kvclient/kvcoord/dist_sender.go:2766 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=985379219784892417›] 2285  slow replica RPC: have been waiting 12.62s (0 attempts) for RPC LinkExternalSSTable [/Table/108/1/‹127243›/‹6›/‹609›,/Table/108/1/‹127247›/‹4›/‹1712›) to replica (n10,s10):2; resp: ‹(err: <nil>), *kvpb.LinkExternalSSTableResponse›
W240712 11:40:40.670473 203456 kv/kvclient/kvcoord/dist_sender.go:2766 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=985379219784892417›] 2286  slow replica RPC: have been waiting 12.01s (0 attempts) for RPC LinkExternalSSTable [/Table/108/1/‹128616›/‹1›/‹1857›,/Table/108/1/‹128619›/‹9›/‹2987›) to replica (n10,s10):2; resp: ‹(err: <nil>), *kvpb.LinkExternalSSTableResponse›
W240712 11:40:42.579965 203449 kv/kvclient/kvcoord/dist_sender.go:2766 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=985379219784892417›] 2287  slow replica RPC: have been waiting 14.47s (0 attempts) for RPC LinkExternalSSTable [/Table/108/1/‹110425›/‹6›/‹1567›,/Table/108/1/‹110429›/‹4›/‹2668›) to replica (n10,s10):2; resp: ‹(err: <nil>), *kvpb.LinkExternalSSTableResponse›
W240712 11:40:46.580727 203427 kv/kvclient/kvcoord/dist_sender.go:2766 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=985379219784892417›] 2288  slow replica RPC: have been waiting 19.91s (0 attempts) for RPC LinkExternalSSTable [/Table/108/1/‹130848›/‹2›/‹34›,/Table/108/1/‹130851›/‹10›/‹1096›) to replica (n10,s10):2; resp: ‹(err: <nil>), *kvpb.LinkExternalSSTableResponse›
W240712 11:40:47.451923 203438 kv/kvclient/kvcoord/dist_sender.go:2766 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=985379219784892417›] 2289  slow replica RPC: have been waiting 14.31s (0 attempts) for RPC LinkExternalSSTable [/Table/108/1/‹68837›/‹8›/‹158›,/Table/108/1/‹68839›/‹7›/‹680›) to replica (n10,s10):2; resp: ‹(err: <nil>), *kvpb.LinkExternalSSTableResponse›
W240712 11:40:47.536639 203450 kv/kvclient/kvcoord/dist_sender.go:2766 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=985379219784892417›] 2290  slow replica RPC: have been waiting 15.41s (0 attempts) for RPC LinkExternalSSTable [/Table/110/2/‹68400›/‹6›/‹923›/‹-845›,/Table/110/2/‹68462›/‹9›/‹2771›/‹-396›) to replica (n10,s10):2; resp: ‹(err: <nil>), *kvpb.LinkExternalSSTableResponse›
W240712 11:40:47.562295 203440 kv/kvclient/kvcoord/dist_sender.go:2766 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=985379219784892417›] 2291  slow replica RPC: have been waiting 12.17s (0 attempts) for RPC LinkExternalSSTable [/Table/113/1/‹723›/‹29042›,/Table/113/1/‹725›/‹30720›) to replica (n10,s10):4; resp: ‹(err: <nil>), *kvpb.LinkExternalSSTableResponse›
W240712 11:40:47.593209 203458 kv/kvclient/kvcoord/dist_sender.go:2766 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=985379219784892417›] 2292  slow replica RPC: have been waiting 11.65s (0 attempts) for RPC LinkExternalSSTable [/Table/113/1/‹1641›/‹32378›,/Table/113/1/‹1643›/‹37433›) to replica (n10,s10):4; resp: ‹(err: <nil>), *kvpb.LinkExternalSSTableResponse›

This seems to have occured after kv removed 800+ replicas from node 1... Perhaps thats just a product of the split and scatter phase. Further, removing empty replicas should be quite cheap. I don't see how that could cause slow link external sst rpc errors.

❯ grep "removing replica" cockroach.log | wc -l
     861

msbutler commented 1 month ago

The restore job on node 1 errored when attempting to write to job system on behalf of the job in the progress loop because node 1's job session was not in the job record:

E240712 11:41:44.995718 14600 jobs/registry.go:1585 ⋮ [T1,Vsystem,n1] 2363  RESTORE job 985379219784892417: stepping through state reverting with unexpected error: importing 694 ranges: job progress loop: job 985379219784892417: with status "running": expected session "01018090ad60e2559c41e5bdc46e9cc9d64c6a" but found "010180aff75cff8ad04b25a2d50c96edf1bd8c"

and indeed node 8 picked up the job after node 1's session expired.

8.unredacted/cockroach.log:W240712 11:41:47.043883 168069 ccl/backupccl/restore_online.go:106 ⋮ [T1,Vsystem,n8,job=‹RESTORE id=985379219784892417›] 1984  failed to scatter during experimental restore: existing range size 169405228 exceeds specified limit 4194304

According to the jobs table snapshot in the debug zip, the job is indeed running!

So this seems to be a real bug: When the job coordinator looses it's lease, it should not send the error:

job 985379219784892417: could not mark as reverting: job 985379219784892417: with status "running": expected session "01018090ad60e2559c41e5bdc46e9cc9d64c6a" but found "010180aff75cff8ad04b25a2d50c96edf1bd8c"

back to the client, rather it should be logged, as it implies a different node has resumed the job.

Here's the final error in the job system before returning to the client: https://github.com/msbutler/cockroach/blob/butler-gate-retry-reason/pkg/jobs/adopt.go#L456

cockroachdb / cockroach

roachtest: restore/online/workload=true/tpcc/8TB/gce/inc-count=0/nodes=10/cpus=8 failed [job lease transfer returned error to user] #127058