All queries fail with "all SubConns are in TransientFailure" after upgrade

cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

https://cortexmetrics.io/

Apache License 2.0

5.47k stars 796 forks source link

All queries fail with "all SubConns are in TransientFailure" after upgrade #882

Closed tomwilkie closed 6 years ago

tomwilkie commented 6 years ago

level=error ts=2018-07-13T19:09:18.433330722Z caller=engine.go:494 msg="error selecting series set" err="rpc error: code = Unavailable desc = all SubConns are in TransientFailure"

I suspect gRPC connections from querier -> ingester got upset after an update...

tomwilkie commented 6 years ago

Deleting/recreating all the queriers caused it to recover, temporarily.

tomwilkie commented 6 years ago

Adding GRPC_GO_LOG_SEVERITY_LEVEL=INFO to the pod got me:

WARNING: 2018/07/13 19:46:48 grpc: addrConn.createTransport failed to connect to {10.52.10.50:9095 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.52.10.50:9095: connect: connection refused". Reconnecting...

Which corresponds with the ingester that was exiting as part of the rollout:

ingester-f7fcbc9fc-2dk4d 1/1 Terminating 0 4d 10.52.10.50

We should be able to tolerate a single down ingester for queries, looks like there might be a bug there.

And this, people, is why you don't deploy on Fridays...

tomwilkie commented 6 years ago

We have tests to show we can tolerate a single dead ingester; I manually tested this by deleting an ingester in dev, and the queries worked fine.

I then rolled out a new ingester, and reproduced it quite nicely in dev. All the IPs addrConn.createTransport reported were for the old ingesters, not the new ones. All the errors were "connection refused". And the ordering was as the ingesters were updated, one by one. And I checked the view of the ring the queriers has, it was consistent with reality.

So it looks like (a) the ingester shuts down it gRPC server too early, (b) the querier code to tolerate one ingester outage is broken and (c) units tests are wrong.

tomwilkie commented 6 years ago

Starting to get to the bottom of this now

since we changed the sharding, queries now read all ingesters
the ring doesn’t read from joining ingesters, consuming our error budget of one
therefore we’re left with the two healthy ingester and the leaving ingester (for RF=3)
the leaving ingester has closed it grpc server for some reason

tomwilkie commented 6 years ago

(so the only real bug is a.)

tomwilkie commented 6 years ago

Fixes in weaveworks/common#99, which was includes in #870