Request sent to wrong server

Levatius commented 6 years ago

As suggested, creating a new issue to track overflow from: https://github.com/dgraph-io/dgraph/issues/2172

Summary:

We are using a 3-server dgraph setup within a Kubernetes cluster.
We have found that our servers can crash with error:

2018/03/08 16:24:25 attr: "_programme_created" groupId: 1 Request sent to wrong server. github.com/dgraph-io/dgraph/x.AssertTruef /home/travis/gopath/src/github.com/dgraph-io/dgraph/x/error.go:67 github.com/dgraph-io/dgraph/worker.(grpcWorker).ServeTask /home/travis/gopath/src/github.com/dgraph-io/dgraph/worker/task.go:1250 github.com/dgraph-io/dgraph/protos/intern._Worker_ServeTask_Handler /home/travis/gopath/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2563 google.golang.org/grpc.(Server).processUnaryRPC /home/travis/gopath/src/google.golang.org/grpc/server.go:900 google.golang.org/grpc.(Server).handleStream /home/travis/gopath/src/google.golang.org/grpc/server.go:1122 google.golang.org/grpc.(Server).serveStreams.func1.1 /home/travis/gopath/src/google.golang.org/grpc/server.go:617 runtime.goexit /home/travis/.gimme/versions/go1.9.2.linux.amd64/src/runtime/asm_amd64.s:2337
After such a crash, we have observed that the affected server may not get re-elected as the leader of its prior group halting all further mutations.

Levatius commented 6 years ago

@pawanrawal @jimanvlad

pawanrawal commented 6 years ago

Does it enter a crash loop where it keeps on failing with this assert and hence isn't able to become the leader?

pawanrawal commented 6 years ago

Ok, this happens during a query. From the logs, the request was sent to a node serving group 2 and it thought that the predicate _programme_created belongs to group 1, hence the crash. Seems like a race condition, investigating how this could happen.

Do you know if there could have been a DropAttr request with the attribute as _programme_created?

Levatius commented 6 years ago

We do not drop any items from the graph whilst the pipeline is operational, we just drop everything when redeploying to run a fresh test.

In a recent crash around 2018/03/14 20:30:34, we see that the affected server recovers as leader of its group but all further predicate moves are met by "DeadlineExceeded" errors:

(server-0: logs_server-0.txt)

2018/03/14 20:29:17 draft.go:653: Couldn't take snapshot, txn watermark: [2221303], applied watermark: [2484344] 2018/03/14 20:29:47 draft.go:653: Couldn't take snapshot, txn watermark: [2221303], applied watermark: [2488920] 2018/03/14 20:30:17 draft.go:653: Couldn't take snapshot, txn watermark: [2221303], applied watermark: [2494364] 2018/03/14 20:30:33 predicate_move.go:164: Sent 45160 number of keys for predicate _affiliate_xid 2018/03/14 20:30:34 attr: "_affiliate_xid" groupId: 3 Request sent to wrong server. github.com/dgraph-io/dgraph/x.AssertTruef /home/travis/gopath/src/github.com/dgraph-io/dgraph/x/error.go:67 github.com/dgraph-io/dgraph/worker.(grpcWorker).ServeTask /home/travis/gopath/src/github.com/dgraph-io/dgraph/worker/task.go:1261 github.com/dgraph-io/dgraph/protos/intern._Worker_ServeTask_Handler /home/travis/gopath/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2542 google.golang.org/grpc.(Server).processUnaryRPC /home/travis/gopath/src/google.golang.org/grpc/server.go:900 google.golang.org/grpc.(Server).handleStream /home/travis/gopath/src/google.golang.org/grpc/server.go:1122 google.golang.org/grpc.(Server).serveStreams.func1.1 /home/travis/gopath/src/google.golang.org/grpc/server.go:617 runtime.goexit /home/travis/.gimme/versions/go1.9.4.linux.amd64/src/runtime/asm_amd64.s:2337 2018/03/14 20:30:39 gRPC server started. Listening on port 9080 2018/03/14 20:30:39 HTTP server started. Listening on port 8080 2018/03/14 20:30:39 groups.go:86: Current Raft Id: 1 2018/03/14 20:30:39 worker.go:99: Worker listening at address: [::]:7080 2018/03/14 20:30:39 pool.go:118: == CONNECT ==> Setting dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080 2018/03/14 20:30:39 groups.go:109: Connected to group zero. Assigned group: 0 2018/03/14 20:30:39 pool.go:118: == CONNECT ==> Setting dgraph-server-1.dgraph-server.default.svc.cluster.local:7080 2018/03/14 20:30:39 pool.go:118: == CONNECT ==> Setting dgraph-server-2.dgraph-server.default.svc.cluster.local:7080 2018/03/14 20:30:39 draft.go:139: Node ID: 1 with GroupID: 1 2018/03/14 20:30:39 node.go:231: Found Snapshot, Metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:2221293 Term:2 XXX_unrecognized:[]} 2018/03/14 20:30:39 node.go:246: Found hardstate: {Term:3 Vote:1 Commit:2496525 XXX_unrecognized:[]} 2018/03/14 20:30:40 node.go:258: Group 1 found 275232 entries 2018/03/14 20:30:40 draft.go:717: Restarting node for group: 1 2018/03/14 20:30:40 raft.go:567: INFO: 1 became follower at term 3 2018/03/14 20:30:40 raft.go:315: INFO: newRaft 1 [peers: [1], term: 3, commit: 2496525, applied: 2221293, lastindex: 2496525, lastterm: 3] 2018/03/14 20:30:48 raft.go:749: INFO: 1 is starting a new election at term 3 2018/03/14 20:30:48 raft.go:580: INFO: 1 became candidate at term 4 2018/03/14 20:30:48 raft.go:664: INFO: 1 received MsgVoteResp from 1 at term 4 2018/03/14 20:30:48 raft.go:621: INFO: 1 became leader at term 4 2018/03/14 20:30:48 node.go:301: INFO: raft.node: 1 elected leader 1 at term 4

(zero: logs_zero.txt)

Groups sorted by size: [{gid:3 size:240772844} {gid:2 size:301465102} {gid:1 size:349930572}]

2018/03/14 20:30:30 tablet.go:188: size_diff 109157728 2018/03/14 20:30:30 tablet.go:78: Going to move predicate: [_affiliate_xid], size: [4.0 MB] from group 1 to 3 2018/03/14 20:30:34 tablet.go:113: Predicate move done for: [_affiliate_xid] from group 1 to 3 2018/03/14 20:30:39 zero.go:333: Got connection request: id:1 addr:"dgraph-server-0.dgraph-server.default.svc.cluster.local:7080" 2018/03/14 20:30:39 zero.go:430: Connected 2018/03/14 20:38:30 tablet.go:183:

Groups sorted by size: [{gid:3 size:245704073} {gid:2 size:334222283} {gid:1 size:345961394}]

2018/03/14 20:38:30 tablet.go:188: size_diff 100257321 2018/03/14 20:38:30 tablet.go:78: Going to move predicate: [_checkout_xid], size: [3.5 MB] from group 1 to 3 2018/03/14 20:45:04 raft.go:556: While applying proposal: Tablet is already being served 2018/03/14 20:45:04 raft.go:556: While applying proposal: Tablet is already being served 2018/03/14 20:50:04 raft.go:556: While applying proposal: Tablet is already being served 2018/03/14 20:50:47 raft.go:556: While applying proposal: Tablet is already being served 2018/03/14 20:55:15 raft.go:556: While applying proposal: Tablet is already being served 2018/03/14 20:55:46 raft.go:556: While applying proposal: Tablet is already being served 2018/03/14 20:55:46 raft.go:556: While applying proposal: Tablet is already being served 2018/03/14 20:58:30 tablet.go:69: Error while trying to move predicate _checkout_xid from 1 to 3: rpc error: code = DeadlineExceeded desc = context deadline exceeded 2018/03/14 20:58:30 tablet.go:183:

Other restarts before this seem to recover fine.

Logs from other servers: logs_server-1.txt logs_server-2.txt

We were wondering also whether the following is an issue?

Couldn't take snapshot, txn watermark: [1324655], applied watermark: [2441032]

pawanrawal commented 6 years ago

The assert here happens after a predicate move in all cases.

2018/03/14 20:30:33 predicate_move.go:164: Sent 45160 number of keys for predicate _affiliate_xid
2018/03/14 20:30:34 attr: "_affiliate_xid" groupId: 3 Request sent to wrong server.
github.com/dgraph-io/dgraph/x.AssertTruef
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/x/error.go:67
github.com/dgraph-io/dgraph/worker.(*grpcWorker).ServeTask
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/worker/task.go:1261
github.com/dgraph-io/dgraph/protos/intern._Worker_ServeTask_Handler
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2542
google.golang.org/grpc.(*Server).processUnaryRPC
        /home/travis/gopath/src/google.golang.org/grpc/server.go:900
google.golang.org/grpc.(*Server).handleStream
        /home/travis/gopath/src/google.golang.org/grpc/server.go:1122
google.golang.org/grpc.(*Server).serveStreams.func1.1
        /home/travis/gopath/src/google.golang.org/grpc/server.go:617
runtime.goexit
        /home/travis/.gimme/versions/go1.9.4.linux.amd64/src/runtime/asm_amd64.s:2337

What could happen is that the predicate move is in progress for attr from server 1 to 2 and server 3 received a request. Now server 3 would send the request to server 1 but by the time the request is received by server 1, server 2 is serving the predicate and now server 1 crashes. Since this is a temporary query error (until all servers catch up with the latest membership state from Zero), we can just return an error when this happens and the client can retry.

pawanrawal commented 6 years ago

Couldn't take snapshot, txn watermark: [1324655], applied watermark: [2441032]

This is not the reason for the crash. This should typically be a temporary issue. I will modify the logs to print this as a warning to make it clear.

In a recent crash around 2018/03/14 20:30:34, we see that the affected server recovers as leader of its group but all further predicate moves are met by "DeadlineExceeded" errors:

Interestingly, not all further predicate moves were met by the error. From logs_server-2.txt, the server died at '18:58:37'

2018/03/14 18:58:38 attr: "_all_" groupId: 1 Request sent to wrong server.
github.com/dgraph-io/dgraph/x.AssertTruef

After that predicates moves were successful until. Seems like somehow the connection got broken at this point and wasn't reconnected. I am checking this.

2018/03/14 20:30:30 tablet.go:78: Going to move predicate: [_affiliate_xid], size: [4.0 MB] from group 1 to 3
2018/03/14 20:30:34 tablet.go:113: Predicate move done for: [_affiliate_xid] from group 1 to 3

pawanrawal commented 6 years ago

I have pushed a docker image at dgraph/dgraph:dev which adds more relevant logs. It doesn't fix all the issues that you are seeing but would help us debug some of them better.

It would be great if you could use it to load some data and share the logs. Otherwise, if you can share some steps to reproduce like some sample data/scripts, then I am happy to bring up a Kubernetes cluster and debug this at my end. Keen to resolve these issues as soon as possible.

Levatius commented 6 years ago

Since this is a temporary query error (until all servers catch up with the latest membership state from Zero), we can just return an error when this happens and the client can retry.

I see, that would be good if we could just retry the query and avoid having the server crash.

Interestingly, not all further predicate moves were met by the error. From logs_server-2.txt, the server died at '18:58:37'

Ah yes, I was not initially clear, there were multiple server crashes beforehand but that particular crash at 20:30:34 seemed to be what broke things.

I have pushed a docker image at dgraph/dgraph:dev which adds more relevant logs. It doesn't fix all the issues that you are seeing but would help us debug some of them better.

We have pulled this image for testing, here are some logs surrounding a crash:

2018/03/16 14:10:52 draft.go:696: Writing snapshot at index: 1408440, applied mark: 1903731 2018/03/16 14:11:22 draft.go:696: Writing snapshot at index: 1571049, applied mark: 1910997 2018/03/16 14:11:52 draft.go:678: Couldn't take snapshot, txn watermark: [1571059], applied watermark: [1918832] 2018/03/16 14:12:22 draft.go:696: Writing snapshot at index: 1657197, applied mark: 1928588 2018/03/16 14:12:52 draft.go:696: Writing snapshot at index: 1738019, applied mark: 1942352 2018/03/16 14:13:22 draft.go:678: Couldn't take snapshot, txn watermark: [1738029], applied watermark: [1952159] 2018/03/16 14:13:52 draft.go:696: Writing snapshot at index: 1794546, applied mark: 1966497 2018/03/16 14:14:22 draft.go:696: Writing snapshot at index: 1861982, applied mark: 1978615 2018/03/16 14:14:52 draft.go:678: Couldn't take snapshot, txn watermark: [1861992], applied watermark: [1989882] 2018/03/16 14:15:12 predicate_move.go:281: Move predicate request for pred: [HAS_TRIGGER], src: [2], dst: [1] 2018/03/16 14:15:12 predicate_move.go:164: Sent 19 number of keys for predicate HAS_TRIGGER 2018/03/16 14:15:22 draft.go:696: Writing snapshot at index: 1985970, applied mark: 2000896 2018/03/16 14:15:29 attr: "all" groupId: 3 Request sent to wrong server. github.com/dgraph-io/dgraph/x.AssertTruef /home/pawan/go/src/github.com/dgraph-io/dgraph/x/error.go:67 github.com/dgraph-io/dgraph/worker.(grpcWorker).ServeTask /home/pawan/go/src/github.com/dgraph-io/dgraph/worker/task.go:1261 github.com/dgraph-io/dgraph/protos/intern._Worker_ServeTask_Handler /home/pawan/go/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2542 google.golang.org/grpc.(Server).processUnaryRPC /home/pawan/go/src/google.golang.org/grpc/server.go:900 google.golang.org/grpc.(Server).handleStream /home/pawan/go/src/google.golang.org/grpc/server.go:1122 google.golang.org/grpc.(Server).serveStreams.func1.1 /home/pawan/go/src/google.golang.org/grpc/server.go:617 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:2337 2018/03/16 14:15:34 gRPC server started. Listening on port 9080 2018/03/16 14:15:34 HTTP server started. Listening on port 8080 2018/03/16 14:15:34 groups.go:86: Current Raft Id: 2 2018/03/16 14:15:34 worker.go:99: Worker listening at address: [::]:7080 2018/03/16 14:15:34 pool.go:118: == CONNECT ==> Setting dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080 2018/03/16 14:15:34 groups.go:109: Connected to group zero. Assigned group: 0 2018/03/16 14:15:34 pool.go:118: == CONNECT ==> Setting dgraph-server-2.dgraph-server.default.svc.cluster.local:7080 2018/03/16 14:15:34 pool.go:118: == CONNECT ==> Setting dgraph-server-0.dgraph-server.default.svc.cluster.local:7080 2018/03/16 14:15:34 draft.go:139: Node ID: 2 with GroupID: 2 2018/03/16 14:15:34 node.go:231: Found Snapshot, Metadata: {ConfState:{Nodes:[2] XXX_unrecognized:[]} Index:1985970 Term:2 XXX_unrecognized:[]} 2018/03/16 14:15:34 node.go:246: Found hardstate: {Term:2 Vote:2 Commit:2001474 XXX_unrecognized:[]} 2018/03/16 14:15:34 node.go:258: Group 2 found 15505 entries 2018/03/16 14:15:34 draft.go:744: Restarting node for group: 2 2018/03/16 14:15:34 raft.go:567: INFO: 2 became follower at term 2 2018/03/16 14:15:34 raft.go:315: INFO: newRaft 2 [peers: [2], term: 2, commit: 2001474, applied: 1985970, lastindex: 2001474, lastterm: 2] 2018/03/16 14:15:38 raft.go:749: INFO: 2 is starting a new election at term 2 2018/03/16 14:15:38 raft.go:580: INFO: 2 became candidate at term 3 2018/03/16 14:15:38 raft.go:664: INFO: 2 received MsgVoteResp from 2 at term 3 2018/03/16 14:15:38 raft.go:621: INFO: 2 became leader at term 3 2018/03/16 14:15:38 node.go:301: INFO: raft.node: 2 elected leader 2 at term 3 2018/03/16 14:15:40 groups.go:336: Serving tablet for: _country_xid 2018/03/16 14:15:40 groups.go:336: Serving tablet for: Promo_Code_Set_Name 2018/03/16 14:15:40 groups.go:336: Serving tablet for: Promo_Code_Set_Id 2018/03/16 14:15:40 groups.go:336: Serving tablet for: Promo_Code

What new log types should there be? We could not spot anything new but it could be subtle.

Otherwise, if you can share some steps to reproduce like some sample data/scripts

We are interested in creating a new sample rig, the current issue is that our pipeline now spans 3 pod-types separated using message queues. We will try to come up with a single pod approximation when possible so no additional requirements are needed.

Levatius commented 6 years ago

Zero: dgraph-zero-0_logs.txt

pawanrawal commented 6 years ago

Could you share full logs from all the servers please like before?

Specifically, I am looking for a log saying this in the server logs.

2018/03/16 14:15:40 groups.go:336: Serving tablet for: _all_

Since your crash is for that predicate, I want to see which server got to serve it

pawanrawal commented 6 years ago

I have also updated the image(dgraph/dgraph:dev) to print more logs for Dgraph Zero (i.e. log all tablet proposals that it gets). If you can force pull again before doing another run, it would be helpful. I think I have a good idea of what is happening, logs will help confirm.

Levatius commented 6 years ago

Hi,

We ran the updated image today, we see server crashes but no longer see the cause of the crash in the logs?

Here is everything we have:

logs_server-0.txt graph-server-0

logs_server-1.txt graph-server-1

logs_server-2.txt graph-server-2

logs_zero.txt graph-zero-0

The only thing I find odd is the memory usage of dgraph-server-2 spikes (29GB out of 32GB for that node) at around 17:56 and crashes soon after.

pawanrawal commented 6 years ago

We ran the updated image today, we see server crashes but no longer see the cause of the crash in the logs?

Since there isn't anything in the logs, I suspect these crashes are because of OOM. Though the exact reason can only be found out by inspecting the pods.

Since the crashes that you were seeing earlier were because of a race condition, you don't see them because maybe that condition isn't getting trigged yet. If you run it again, I suspect it should trigger. Also, is it possible to take a heap profile when the memory usage is high and share it?

Thanks for the logs, I will see what I can find and update here.

pawanrawal commented 6 years ago

A fix for this (Request sent to the wrong server) has been merged to master and is also part of the updated dgraph/dgraph:dev image. Feel free to close this issue and open new ones regarding the issues you are facing.

dgraph-io / dgraph

Request sent to wrong server #2227