Group checksum mismatch and pending mutation blocks read-only queries indefinitely

danielmai commented 4 years ago

What version of Dgraph are you using?

v20.03.1

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, OS)?

Ubuntu Linux

Steps to reproduce the issue (command/config used to run Dgraph).

Create a 3 Alpha replica cluster, run a whole series of read-only queries. At the same time, open a new transaction to send a mutation that writes a new predicate. The new predicate changes the group checksum, and the read-only queries fail to respond.

These are the steps to reproduce (and here's an asciinema recording):

Run Dgraph cluster with 3 Alpha replicas:

mkdir data
compose --num_zeros 1 --num_alphas 3 --data_dir ./data --local=false --tag=v20.03.1

Run dgraph increment to create a predicate and then run many read-only queries as quickly as possible (no --wait flag, or --wait=0.1s should work too):
```
dgraph increment --alpha localhost:9180
dgraph increment --alpha localhost:9180 --num=1000000000 --ro
```
While running Step 2, run a mutation with a new predicate. Run the mutation, but don't commit it. Here's a one-liner to run a mutation with a predicate with a random name that would count as a new predicate:
```
i="$(openssl rand -hex 10 | cut -c-20)"; curl -H Content-Type:application/json 'localhost:8180/mutate' --data-binary "{\"set\": [{\"name$i\": \"\"}]}"
```
- This creates a predicate with a random 20-character hex string
- This calls /mutate to open a new txn that does not commit it.

Repeat Step 3 until the read-only queries in Step 2 get blocked:

0505 12:01:41.934 Counter VAL: 1   [ Ts: 56 ] Latency: Q 2ms M 0s S 2ms C 3ms D 1ms
0505 12:01:41.936 Counter VAL: 1   [ Ts: 56 ] Latency: Q 2ms M 0s S 2ms C 3ms D 1ms
0505 12:01:46.937 While trying to process counter: while doing query: rpc error: code = DeadlineExceeded desc = context deadline exceeded. Retrying...
0505 12:01:52.938 While trying to process counter: while doing query: rpc error: code = DeadlineExceeded desc = context deadline exceeded. Retrying...

In Jaeger/zPages, you'll see a trace error for api.Dgraph.Query with the error message Group checksum mismatch for id: 1:

http://localhost:8180/z/tracez?zspanname=api.Dgraph.Query&ztype=2&zsubtype=0

2020/05/04-23:43:51.899578     5.000438 trace_id: 39a6de5cc3e9ccc232dc45b864b7c441 span_id: 3f7c906e11b127c2 parent_span_id: 69ad9eef2b2ed877
                                        Status{canonicalCode=UNKNOWN, description=": Group checksum mismatch for id: 1"}
                                        Attributes:{Client=false, FailFast=false}
           23:43:51.899632      .    53 received message [58 bytes, 58 compressed bytes]

Eventually, when the open transaction gets aborted, the queries become unblocked. By default, open transactions are aborted after 5 minutes of inactivity.

Expected behaviour and actual result.

Queries should not get blocked by a pending transaction.

martinmr commented 4 years ago

Right now the code is explicitly blocking until the checksums match in two places.

1 In sort.go at the beginning of processSort.

In task.go at the beginning of processTask.

I am still not entirely sure why uncommitted transactions are causing the checksums to change. But I suppose it makes sense for normal queries. I don't think read-only queries should get blocked. This could be the fix and it'd be a very easy one. I'll discuss this with Manish.

martinmr commented 4 years ago

Found the root cause but I still don't know what the proper fix.

The bug is in this part of the processOracleDeltaStream


        SLURP:
            for {
                select {
                case more := <-deltaCh:
                    if more == nil {
                        return
                    }
                    batch++
                    delta.Txns = append(delta.Txns, more.Txns...)
                    delta.MaxAssigned = x.Max(delta.MaxAssigned, more.MaxAssigned)
                default:
                    break SLURP
                }
            }

The latest delta gets appended to the delta that was previously received from the channel (my guess is that this was done to reduce the number of proposals). However, the GroupChecksum is lost at this step so the queries get stuck waiting for the checksums to match.

Adding another tablet via the mutation in step 3, unblocks the process. If I can safely apply the mutations in the order they are received from the stream, then the fix is simply to overwrite the group checksums. I am not sure if this is the case so I'll keep looking.

dgraph-io / dgraph