Closed tomwilkie closed 6 years ago
Deleting/recreating all the queriers caused it to recover, temporarily.
Adding GRPC_GO_LOG_SEVERITY_LEVEL=INFO
to the pod got me:
WARNING: 2018/07/13 19:46:48 grpc: addrConn.createTransport failed to connect to {10.52.10.50:9095 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.52.10.50:9095: connect: connection refused". Reconnecting...
Which corresponds with the ingester that was exiting as part of the rollout:
ingester-f7fcbc9fc-2dk4d 1/1 Terminating 0 4d 10.52.10.50
We should be able to tolerate a single down ingester for queries, looks like there might be a bug there.
And this, people, is why you don't deploy on Fridays...
We have tests to show we can tolerate a single dead ingester; I manually tested this by deleting an ingester in dev, and the queries worked fine.
I then rolled out a new ingester, and reproduced it quite nicely in dev. All the IPs addrConn.createTransport
reported were for the old ingesters, not the new ones. All the errors were "connection refused". And the ordering was as the ingesters were updated, one by one. And I checked the view of the ring the queriers has, it was consistent with reality.
So it looks like (a) the ingester shuts down it gRPC server too early, (b) the querier code to tolerate one ingester outage is broken and (c) units tests are wrong.
Starting to get to the bottom of this now
(so the only real bug is a.)
Fixes in weaveworks/common#99, which was includes in #870
level=error ts=2018-07-13T19:09:18.433330722Z caller=engine.go:494 msg="error selecting series set" err="rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
I suspect gRPC connections from querier -> ingester got upset after an update...