Open cockroach-teamcity opened 5 years ago
^- release-2.1
I'm unable to reproduce on master using make stress PKG=./storage/ TESTS=TestPushTxnHeartbeatTimeout
. Trying release-2.1
now.
Reproduced in a little over a minute on release-2.1
:
~ make stress PKG=./storage/ TESTS=TestPushTxnHeartbeatTimeout
...
--- FAIL: TestPushTxnHeartbeatTimeout (0.02s)
replica_test.go:5613: 4: expSuccess=false; got pErr <nil>, args={RequestHeader:{Key:"key-4" EndKey:/Min Sequence:0} PusherTxn:"pusher" id=1e0eef74 key="key-4" rw=false pri=0.01352606 iso=SERIALIZABLE stat=PENDING epo=0 ts=0.000000123,74 orig=0.000000123,74 max=0.000000124,74 wto=false rop=false seq=0 PusheeTxn:{ID:a364f77e-d29f-4b0a-92de-aca445e76d0b Isolation:SERIALIZABLE Key:[107 101 121 45 52] Epoch:0 Timestamp:0.000000123,73 Priority:0 Sequence:1 DeprecatedBatchIndex:0} PushTo:1.000000123,73 Now:1.000000123,73 PushType:PUSH_ABORT Force:false}, reply=header:<> pushee_txn:<meta:<id:<a364f77e-d29f-4b0a-92de-aca445e76d0b> key:"key-4" timestamp:<wall_time:123 logical:77 > sequence:1 > name:"test-4" status:ABORTED last_heartbeat:<wall_time:123 logical:77 > orig_timestamp:<wall_time:123 logical:73 > max_timestamp:<wall_time:124 logical:73 > refreshed_timestamp:<> epoch_zero_timestamp:<> >
FAIL
ERROR: exit status 1
4205 runs completed, 1 failures, over 1m16s
I'm not terribly familiar with this code. Seems a bit worrisome for this to be failing. @andreimatei and @tschottdorf can you also take a look?
Took another crack at reproducing this on master this morning and had success:
--- FAIL: TestPushTxnHeartbeatTimeout (0.02s)
replica_test.go:5711: 7: expSuccess=false; got pErr <nil>, args={RequestHeader:{Key:"key-7" EndKey:/Min Sequence:0} PusherTxn:"pusher" id=846652a6 key="key-7" rw=false pri=0.01264969 iso=SERIALIZABLE stat=PENDING epo=0 ts=0.000000123,108 orig=0.000000123,108 max=0.000000124,108 wto=false rop=false seq=0 PusheeTxn:{ID:091630d6-1485-4166-88c0-8c4b2fe8e545 Isolation:SERIALIZABLE Key:[107 101 121 45 55] Epoch:0 Timestamp:0.000000123,107 Priority:0 Sequence:1 DeprecatedBatchIndex:0} PushTo:2.000000122,107 Now:2.000000122,107 PushType:PUSH_ABORT Force:false}, reply=header:<> pushee_txn:<meta:<id:<091630d6-1485-4166-88c0-8c4b2fe8e545> key:"key-7" timestamp:<wall_time:123 logical:111 > sequence:1 > name:"test-7" status:ABORTED last_heartbeat:<wall_time:123 logical:111 > orig_timestamp:<wall_time:123 logical:107 > max_timestamp:<wall_time:124 logical:107 > refreshed_timestamp:<> epoch_zero_timestamp:<> >
FAIL
So whatever is going on, it wasn't fixed on master
recently.
Taking a look.
Looks like the code below does it. We're probably setting up priorities that have some randomness in them.
case CanPushWithPriority(&args.PusherTxn, &reply.PusheeTxn):
reason = "pusher has priority"
pusherWins = true
Reopening to investigate based on @andreimatei's confusion in https://github.com/cockroachdb/cockroach/pull/31965.
Following up on @andreimatei's https://github.com/cockroachdb/cockroach/pull/31965#issuecomment-433992581:
MakePriority(1)
returns a randomized priority. It's only when you specify {Min,Max}UserPriority
that you get "fixed" ones.
This seems wrong, but it also seems that that's what you get in the SQL layer:
What saves us, I think, is this code:
We're going to go into the queue unless the priority is either of {Min,Max}UserPriority
. But if there's nothing in the queue, we'll go to push immediately:
This smells. Say you have txns txn1 and txn2 and txn2 has higher priority (but they're both "equally good" sql transactions. If txn2 runs into an intent of txn2, it'll just push it and abort txn1. But if some other txn txn3 with a priority lower than txn1 tries to push txn1 first, it'll get a failure and put txn1 in the txnWaitQueue and now if txn2 comes along it'll enqueue and wait for txn1 to complete. This behavior seems pretty unsubstantiated.
I wonder if my reading of this is just wrong (but I spent ~20min looking at it and am fairly sure). I also noticed that the testing doesn't exercise this at all because it enqueues the pushee manually before sending the pusher along. If it didn't do that, it'd also flake left and right because it also uses a user priority of 1.
cc @nvanbenschoten since this also has potential performance implications as I expect it to cause more spurious aborts than necessary.
If txn2 runs into an intent of txn1, it'll just push it and abort txn1.
That's not my understanding. Unless txn1 has MinPriority or txn2 has MaxPriority, ShouldPushImmediately
will return false and txn1 will be put in the txnWaitQueue. If the txn is not already in the txnWaitQueue then the PushTxnRequest
will evaluate. However, what we're missing is CanPushWithPriority
:
This logic ensures that evaluating the PushTxnRequest
result in a TransactionPushError
, which will result in txn1 being put in the txnWaitQueue.
Thanks for clarifying, @nvanbenschoten. This seems awfully roundabout, but at least it's working "as intended". Is there any particular reason we don't just assign some fixed number and make sure that it doesn't change? We're not using priorities for anything but breaking the deadlock (where we could also use anything else), do we?
Is there any particular reason we don't just assign some fixed number and make sure that it doesn't change
I've wondered this as well. I don't know the history well enough to give an answer. Perhaps @andreimatei does? Seems like we could just as easily use txn orig timestamps to break deadlocks.
What I know about the history of the priorities is that they used to matter, and Radu put some fancy math into generating them, and then they stopped mattering and now we have more code than we need.
Separately, one thing that I always saw as a wart was the fact that we have roachpb.UserPriority
and also lower level priorities (naked ints
). And various APIs convert between the two. And I think there was always chaos because some levels of the code that were dealing with restarts only new the naked int
and not the original UserPriority
although they would have liked to know it (specifically when dealing with restarts, I think), and I tried to make things a bit saner in various refactorings but perhaps I've made it worse.
I definitely think a cleanup in this area is in order.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
SHA: https://github.com/cockroachdb/cockroach/commits/0dba537ae88e495ddf29b4c347b4c30ee99bd046
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=985354&tab=buildLog
Jira issue: CRDB-4777