All reads time out indefinitely, possibly due to stalled predicate migration?

aphyr commented 6 years ago

On the build @manishrjain submitted for testing on Friday, May 18, 2018, single-record set tests can stall indefinitely during reading. Although the cluster is stable, all nodes are running, and the network is totally connected, and all test-initiated predicate migrations appear to have completed, every read request will time out. This condition can last for at least an hour.

For instance, see 20180522T125649.000-0500.zip, where in the middle of the final reads, transactions just... start timing out.

2018-05-22 13:04:41,203{GMT}    INFO    [jepsen worker 7] jepsen.dgraph.set: Forcing conflict by inserting 7517
2018-05-22 13:04:41,297{GMT}    INFO    [jepsen worker 6] jepsen.dgraph.set: Forcing conflict by inserting 5551
2018-05-22 13:04:41,297{GMT}    INFO    [jepsen worker 5] jepsen.dgraph.set: Forcing conflict by inserting 6854
2018-05-22 13:04:41,297{GMT}    INFO    [jepsen worker 0] jepsen.dgraph.set: Forcing conflict by inserting 7368
2018-05-22 13:04:46,198{GMT}    INFO    [jepsen worker 4] jepsen.util: 4    :info   :read   nil :timeout
2018-05-22 13:04:46,202{GMT}    INFO    [jepsen worker 9] jepsen.util: 9    :info   :read   nil :timeout
2018-05-22 13:04:46,230{GMT}    INFO    [jepsen worker 7] jepsen.util: 7    :info   :read   nil :timeout
2018-05-22 13:04:46,234{GMT}    INFO    [jepsen worker 2] jepsen.util: 2    :info   :read   nil :timeout
2018-05-22 13:04:46,271{GMT}    INFO    [jepsen worker 1] jepsen.util: 1    :info   :read   nil :timeout

Although each process goes on to retry the read process, no subsequent query ever returns:

8-05-22 13:52:27,172{GMT}   INFO    [jepsen worker 9] jepsen.util: 839  :invoke :read   nil
2018-05-22 13:52:28,607{GMT}    INFO    [jepsen worker 8] jepsen.util: 798  :info   :read   nil :timeout
2018-05-22 13:52:32,179{GMT}    INFO    [jepsen worker 9] jepsen.util: 839  :info   :read   nil :timeout
2018-05-22 13:52:38,642{GMT}    INFO    [jepsen worker 2] jepsen.util: 892  :invoke :read   nil
2018-05-22 13:52:40,683{GMT}    INFO    [jepsen worker 3] jepsen.util: 793  :invoke :read   nil
2018-05-22 13:52:43,650{GMT}    INFO    [jepsen worker 2] jepsen.util: 892  :info   :read   nil :timeout
2018-05-22 13:52:45,689{GMT}    INFO    [jepsen worker 3] jepsen.util: 793  :info   :read   nil :timeout

About ten seconds after operations start timing out, Zero on n1 logs:

2018/05/22 11:04:57 tablet.go:179: 

Groups sorted by size: [{gid:1 size:17} {gid:2 size:2235}]

2018/05/22 11:04:57 tablet.go:184: size_diff 2218
2018/05/22 11:04:57 tablet.go:73: Going to move predicate: [value], size: [1.1 kB] from group 2 to 1
2018/05/22 11:24:57 tablet.go:221: Got error during move: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/05/22 11:24:57 tablet.go:60: Error while trying to move predicate value from 2 to 1: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded

2018/05/22 11:24:57 tablet.go:179: 

Groups sorted by size: [{gid:1 size:17} {gid:2 size:2235}]

2018/05/22 11:24:57 tablet.go:184: size_diff 2218
2018/05/22 11:24:57 tablet.go:73: Going to move predicate: [value], size: [1.1 kB] from group 2 to 1
2018/05/22 11:44:57 tablet.go:221: Got error during move: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/05/22 11:44:57 tablet.go:60: Error while trying to move predicate value from 2 to 1: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded

2018/05/22 11:44:57 tablet.go:179: 

Groups sorted by size: [{gid:1 size:17} {gid:2 size:2235}]

2018/05/22 11:44:57 tablet.go:184: size_diff 2218
2018/05/22 11:44:57 tablet.go:73: Going to move predicate: [value], size: [1.1 kB] from group 2 to 1

And n4 logs corresponding predicate moves:

2018/05/22 11:04:57 predicate_move.go:301: Move predicate request for pred: [value], src: [2], dst: [1]
2018/05/22 11:04:57 predicate_move.go:311: Trying to abort pending mutations. Loop: 0
2018/05/22 11:05:11 draft.go:825: Writing snapshot at index: 13819, applied mark: 26530
2018/05/22 11:24:57 predicate_move.go:301: Move predicate request for pred: [value], src: [2], dst: [1]
2018/05/22 11:24:57 predicate_move.go:311: Trying to abort pending mutations. Loop: 0
2018/05/22 11:44:57 predicate_move.go:301: Move predicate request for pred: [value], src: [2], dst: [1]
2018/05/22 11:44:57 predicate_move.go:311: Trying to abort pending mutations. Loop: 0

No other node logs anything after 11:04. Is it possible that an automatic predicate migration started, then got stuck somehow? The 20-minute intervals between migrations suggest that it's retrying the migration, at least, but timing out every request looks... odd.

You can reproduce this with Jepsen b0b458d32e43c072f257b75ea786431ea0d0c7a5 by running:

lein run test --local-binary ~/tmp/dgraph --force-download --nemesis move-tablet --sequencing client --upsert-schema --time-limit 300 --concurrency 2n --nemesis-interval 5 --workload uid-set --retry-db-setup --test-count 10

manishrjain commented 5 years ago

The reads no longer time out. But, the test never finishes. It keeps on spitting:

2019-02-01 23:22:16,808{GMT}    INFO    [jepsen worker 1] jepsen.dgraph.set: Forcing conflict by deleting ...

Looks like these conflicts don't happen, and the workers just keep looping over them forever. Might be a logic based on an old behavior of Dgraph which no longer happens.

manishrjain commented 5 years ago

Filed an issue: https://github.com/jepsen-io/jepsen/issues/307

manishrjain commented 5 years ago

Haven't seen this happen for a while. Closing.

dgraph-io / dgraph

All reads time out indefinitely, possibly due to stalled predicate migration? #2405