Closed cockroach-teamcity closed 4 months ago
I'm going to see if I can reproduce this with https://github.com/cockroachdb/cockroach/pull/122056.
This is reproducing 1/10,000 failures. Interestingly, it's not hitting a validation failure even with lock promotion validation in there. I'll switch debugging tacts here.
It's readily reproducible under deadlock. Debugging using https://github.com/cockroachdb/cockroach/pull/122009.
It's readily reproducible under deadlock.
This turned out to be a red herring. Turns out, this test is extremely susceptible to overload. If I don't constrain the number of parallel runs (using --cpu
), we see failures on release-23.2
as well.
I haven't been able to piece the entire story together with the logging I've added. Most failures take the form where all requests are queued behind a single lock holder. For example:
We see that all requests are blocked behind req 388, which itself is waiting on the lock held by txn 000000000194
. In all my reproductions, the transaction in question that holds the lock has been finalized previously. Moreover, its locks have also been resolved, and resolved correctly, as judging by the lock table's state. How is this lock appearing in the lock table again?
One possibility would be if the test is slow and there's a "lagging" lock acquisition that lands after the locks have been resolved. But that doesn't seem plausible given we won't kick off intent resolution if there are in-flight requests still pending. See:
The test also uses monotonic transaction IDs, so it's not possible for us to be creating a new transaction with the same ID.
Right now, I'm simply failing the test if I find a request is waiting for > 4 minutes. One thread to pull on would be to check introspect the waitingState
on the request. I also want to add some sanity checks around known-to-be-finalized transactions.
All this being said, it seems like we're ending up in a state where the lock is held by a transaction that has been abandoned. If we were to end up in such a situation in reality, we'd detect and handle it correctly when the lock holder is pushed. Given this, and how rare[*] this is (outside of an overloaded test), I'm inclined to downgrade this to a GA-blocker. I still want to understand this a bit better before we're fully confident here. (cc @nvanbenschoten)
[*] For context, I've given this over 150K runs on my GCE worker today, without running under --deadlock
, and I've only seen it fail once.
kv/kvserver/concurrency.TestLockTableConcurrentRequests failed on master @ f117eea22dd7be380c7141cdf6cd7aba92dd9c70:
Fatal error:
panic: test timed out after 24m57s
running tests:
TestLockTableConcurrentRequests (24m57s)
TestLockTableConcurrentRequests/concurrency_8 (24m54s)
Stack:
goroutine 833581 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestLockTableConcurrentRequests test_log_scope.go:170: test logs captured to: outputs.zip/logTestLockTableConcurrentRequests2351147515 test_log_scope.go:81: use -show-logs to present logs inline === RUN TestLockTableConcurrentRequests/concurrency_2 items: 1692, aborted: 0, concurrency violations: 758, lock table: num=0 === RUN TestLockTableConcurrentRequests/concurrency_8 ```
Parameters:
attempt=1
deadlock=true
run=3
shard=6
kv/kvserver/concurrency.TestLockTableConcurrentRequests failed on master @ 67547d7724f8a52646e2e8ecb3ca48b923957d90:
Fatal error:
panic: test timed out after 24m57s
running tests:
TestLockTableConcurrentRequests (24m57s)
TestLockTableConcurrentRequests/concurrency_8 (24m55s)
Stack:
goroutine 1033150 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestLockTableConcurrentRequests test_log_scope.go:170: test logs captured to: outputs.zip/logTestLockTableConcurrentRequests3274046890 test_log_scope.go:81: use -show-logs to present logs inline === RUN TestLockTableConcurrentRequests/concurrency_2 items: 1671, aborted: 0, concurrency violations: 661, lock table: num=0 === RUN TestLockTableConcurrentRequests/concurrency_8 ```
Parameters:
attempt=1
deadlock=true
run=2
shard=6
kv/kvserver/concurrency.TestLockTableConcurrentRequests failed on release-24.1 @ 6e3e9e6012dc04dabc64333a87e77a80d9f9d46b:
Fatal error:
panic: test timed out after 24m57s
running tests:
TestLockTableConcurrentRequests (24m57s)
TestLockTableConcurrentRequests/concurrency_16 (24m39s)
Stack:
goroutine 2588996 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestLockTableConcurrentRequests test_log_scope.go:170: test logs captured to: outputs.zip/logTestLockTableConcurrentRequests1983242438 test_log_scope.go:81: use -show-logs to present logs inline === RUN TestLockTableConcurrentRequests/concurrency_16 ```
Parameters:
attempt=1
deadlock=true
run=2
shard=6
kv/kvserver/concurrency.TestLockTableConcurrentRequests failed on master @ 8daa6933836c8ce1108fb22bf6663117fc71599b:
Fatal error:
panic: test timed out after 24m57s
running tests:
TestLockTableConcurrentRequests (24m57s)
TestLockTableConcurrentRequests/concurrency_16 (24m46s)
Stack:
goroutine 2996049 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestLockTableConcurrentRequests test_log_scope.go:170: test logs captured to: outputs.zip/logTestLockTableConcurrentRequests4126696952 test_log_scope.go:81: use -show-logs to present logs inline === RUN TestLockTableConcurrentRequests/concurrency_2 items: 1677, aborted: 0, concurrency violations: 861, lock table: num=0 === RUN TestLockTableConcurrentRequests/concurrency_8 items: 1677, aborted: 0, concurrency violations: 857, lock table: num=0 === RUN TestLockTableConcurrentRequests/concurrency_16 ```
Parameters:
attempt=1
deadlock=true
run=1
shard=6
kv/kvserver/concurrency.TestLockTableConcurrentRequests failed on master @ 737df7fe75c5698d9ba384ad322f30963582cc42:
Fatal error:
panic: test timed out after 24m57s
running tests:
TestLockTableConcurrentRequests (24m57s)
TestLockTableConcurrentRequests/concurrency_2 (24m57s)
Stack:
goroutine 196958 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestLockTableConcurrentRequests test_log_scope.go:170: test logs captured to: outputs.zip/logTestLockTableConcurrentRequests3562098866 test_log_scope.go:81: use -show-logs to present logs inline === RUN TestLockTableConcurrentRequests/concurrency_2 ```
Parameters:
attempt=1
deadlock=true
run=2
shard=6
kv/kvserver/concurrency.TestLockTableConcurrentRequests failed on master @ 6e94cb3b7992de4dec7688750cfd13971dd2930c:
Fatal error:
panic: test timed out after 24m57s
running tests:
TestLockTableConcurrentRequests (24m57s)
TestLockTableConcurrentRequests/concurrency_4 (9m32s)
Stack:
goroutine 265982376 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestLockTableConcurrentRequests test_log_scope.go:170: test logs captured to: outputs.zip/logTestLockTableConcurrentRequests127189903 test_log_scope.go:81: use -show-logs to present logs inline lock_table_test.go:1609: numRequests: 10000; numActiveTxns: 4; probability(txn-al reqs): 0.75; probability(new txns): 0.25; probability(duplicate access): 0.50 lock_table_test.go:1734: txns creted: 1874 === RUN TestLockTableConcurrentRequests/concurrency_1 items: 11874, aborted: 0, concurrency violations: 11131, lock table: num=0 === RUN TestLockTableConcurrentRequests/concurrency_4 ```
Parameters:
attempt=1
deadlock=true
run=1
shard=6
kv/kvserver/concurrency.TestLockTableConcurrentRequests failed on release-24.1 @ 3111a51bdb2cbf889c92f4440bf88b0eb682cb96:
Fatal error:
panic: test timed out after 24m57s
running tests:
TestLockTableConcurrentRequests (24m57s)
TestLockTableConcurrentRequests/concurrency_16 (24m44s)
Stack:
goroutine 2930059 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestLockTableConcurrentRequests test_log_scope.go:170: test logs captured to: outputs.zip/logTestLockTableConcurrentRequests2180491673 test_log_scope.go:81: use -show-logs to present logs inline === RUN TestLockTableConcurrentRequests/concurrency_16 ```
Parameters:
attempt=1
deadlock=true
run=3
shard=6
kv/kvserver/concurrency.TestLockTableConcurrentRequests failed on master @ e0068814dfcb4f975a53b79b5546b5fb85c0f927:
Fatal error:
panic: test timed out after 24m57s
running tests:
TestLockTableConcurrentRequests (24m57s)
TestLockTableConcurrentRequests/concurrency_32 (21s)
Stack:
goroutine 315646842 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestLockTableConcurrentRequests test_log_scope.go:170: test logs captured to: outputs.zip/logTestLockTableConcurrentRequests103809738 test_log_scope.go:81: use -show-logs to present logs inline lock_table_test.go:1609: numRequests: 10000; numActiveTxns: 16; probability(txn-al reqs): 0.75; probability(new txns): 0.25; probability(duplicate access): 0.50 lock_table_test.go:1734: txns creted: 1837 === RUN TestLockTableConcurrentRequests/concurrency_16 items: 11837, aborted: 0, concurrency violations: 11485, lock table: num=0 === RUN TestLockTableConcurrentRequests/concurrency_32 ```
Parameters:
attempt=1
deadlock=true
run=1
shard=6
kv/kvserver/concurrency.TestLockTableConcurrentRequests failed on master @ 347cdc76d4c5abb2e872f325e944337a46b5883f:
Fatal error:
panic: test timed out after 24m57s
running tests:
TestLockTableConcurrentRequests (24m57s)
TestLockTableConcurrentRequests/concurrency_8 (12m22s)
Stack:
goroutine 350542742 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestLockTableConcurrentRequests test_log_scope.go:170: test logs captured to: outputs.zip/logTestLockTableConcurrentRequests4178933035 test_log_scope.go:81: use -show-logs to present logs inline lock_table_test.go:1609: numRequests: 10000; numActiveTxns: 8; probability(txn-al reqs): 0.90; probability(new txns): 0.10; probability(duplicate access): 0.50 lock_table_test.go:1734: txns creted: 917 === RUN TestLockTableConcurrentRequests/concurrency_2 items: 10917, aborted: 0, concurrency violations: 10537, lock table: num=0 === RUN TestLockTableConcurrentRequests/concurrency_8 ```
Parameters:
attempt=1
deadlock=true
run=1
shard=6
kv/kvserver/concurrency.TestLockTableConcurrentRequests failed on master @ e215cf435851cb9490fe4494fe73d02041c4d2c0:
Fatal error:
panic: test timed out after 24m57s
running tests:
TestLockTableConcurrentRequests (24m57s)
TestLockTableConcurrentRequests/concurrency_4 (12m9s)
Stack:
goroutine 368018592 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestLockTableConcurrentRequests test_log_scope.go:170: test logs captured to: outputs.zip/logTestLockTableConcurrentRequests613030006 test_log_scope.go:81: use -show-logs to present logs inline lock_table_test.go:1609: numRequests: 10000; numActiveTxns: 4; probability(txn-al reqs): 0.75; probability(new txns): 0.10; probability(duplicate access): 0.50 lock_table_test.go:1734: txns creted: 753 === RUN TestLockTableConcurrentRequests/concurrency_1 items: 10753, aborted: 0, concurrency violations: 10204, lock table: num=0 === RUN TestLockTableConcurrentRequests/concurrency_4 ```
Parameters:
attempt=1
deadlock=true
run=2
shard=6
kv/kvserver/concurrency.TestLockTableConcurrentRequests failed on master @ 0fbef53e48c2a3784631a09a42236e1d0636a5a7:
Fatal error:
panic: test timed out after 24m57s
running tests:
TestLockTableConcurrentRequests (24m57s)
TestLockTableConcurrentRequests/concurrency_4 (11m8s)
Stack:
goroutine 358365045 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestLockTableConcurrentRequests test_log_scope.go:170: test logs captured to: outputs.zip/logTestLockTableConcurrentRequests2386894273 test_log_scope.go:81: use -show-logs to present logs inline lock_table_test.go:1609: numRequests: 10000; numActiveTxns: 2; probability(txn-al reqs): 0.75; probability(new txns): 0.25; probability(duplicate access): 0.50 lock_table_test.go:1734: txns creted: 1888 === RUN TestLockTableConcurrentRequests/concurrency_2 items: 11888, aborted: 0, concurrency violations: 11115, lock table: num=0 === RUN TestLockTableConcurrentRequests/concurrency_4 ```
Parameters:
attempt=1
deadlock=true
run=1
shard=6
kv/kvserver/concurrency.TestLockTableConcurrentRequests failed on master @ fe6496ccabf2d4b2234cf8418004e5863847c0bd:
Fatal error:
panic: test timed out after 24m57s
running tests:
TestLockTableConcurrentRequests (24m57s)
TestLockTableConcurrentRequests/concurrency_4 (10m55s)
Stack:
goroutine 329327927 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestLockTableConcurrentRequests test_log_scope.go:170: test logs captured to: outputs.zip/logTestLockTableConcurrentRequests829779615 test_log_scope.go:81: use -show-logs to present logs inline lock_table_test.go:1609: numRequests: 10000; numActiveTxns: 2; probability(txn-al reqs): 0.75; probability(new txns): 0.25; probability(duplicate access): 0.50 lock_table_test.go:1734: txns creted: 1845 === RUN TestLockTableConcurrentRequests/concurrency_2 items: 11845, aborted: 0, concurrency violations: 10743, lock table: num=0 === RUN TestLockTableConcurrentRequests/concurrency_4 ```
Parameters:
attempt=1
deadlock=true
run=1
shard=6
kv/kvserver/concurrency.TestLockTableConcurrentRequests failed on master @ 911f1ce389342459592d89fe7b78bcba3a2d265a:
Fatal error:
Stack:
Log preceding fatal error
``` === RUN TestLockTableConcurrentRequests test_log_scope.go:170: test logs captured to: outputs.zip/logTestLockTableConcurrentRequests4202612793 test_log_scope.go:81: use -show-logs to present logs inline === RUN TestLockTableConcurrentRequests/concurrency_2 ```
Parameters:
attempt=1
deadlock=true
run=3
shard=6
Help
See also: How To Investigate a Go Test Failure (internal)
/cc @cockroachdb/kvThis test on roachdash | Improve this report!
Jira issue: CRDB-37695