Closed danielmai closed 4 years ago
Right now the code is explicitly blocking until the checksums match in two places.
1 In sort.go
at the beginning of processSort
.
task.go
at the beginning of processTask
.I am still not entirely sure why uncommitted transactions are causing the checksums to change. But I suppose it makes sense for normal queries. I don't think read-only queries should get blocked. This could be the fix and it'd be a very easy one. I'll discuss this with Manish.
Found the root cause but I still don't know what the proper fix.
The bug is in this part of the processOracleDeltaStream
SLURP:
for {
select {
case more := <-deltaCh:
if more == nil {
return
}
batch++
delta.Txns = append(delta.Txns, more.Txns...)
delta.MaxAssigned = x.Max(delta.MaxAssigned, more.MaxAssigned)
default:
break SLURP
}
}
The latest delta gets appended to the delta that was previously received from the channel (my guess is that this was done to reduce the number of proposals). However, the GroupChecksum is lost at this step so the queries get stuck waiting for the checksums to match.
Adding another tablet via the mutation in step 3, unblocks the process. If I can safely apply the mutations in the order they are received from the stream, then the fix is simply to overwrite the group checksums. I am not sure if this is the case so I'll keep looking.
What version of Dgraph are you using?
v20.03.1
Have you tried reproducing the issue with the latest release?
Yes
What is the hardware spec (RAM, OS)?
Ubuntu Linux
Steps to reproduce the issue (command/config used to run Dgraph).
Create a 3 Alpha replica cluster, run a whole series of read-only queries. At the same time, open a new transaction to send a mutation that writes a new predicate. The new predicate changes the group checksum, and the read-only queries fail to respond.
These are the steps to reproduce (and here's an asciinema recording):
dgraph increment
to create a predicate and then run many read-only queries as quickly as possible (no--wait
flag, or--wait=0.1s
should work too):/mutate
to open a new txn that does not commit it.Repeat Step 3 until the read-only queries in Step 2 get blocked:
In Jaeger/zPages, you'll see a trace error for api.Dgraph.Query with the error message
Group checksum mismatch for id: 1
:http://localhost:8180/z/tracez?zspanname=api.Dgraph.Query&ztype=2&zsubtype=0
Eventually, when the open transaction gets aborted, the queries become unblocked. By default, open transactions are aborted after 5 minutes of inactivity.
Expected behaviour and actual result.
Queries should not get blocked by a pending transaction.