bcgov / DITP-DevOps

Digital Identity and Trust Program Team's DevOps Documentation Repository
Apache License 2.0
2 stars 6 forks source link

Investigate shared Loki performance and stability issues #193

Closed i5okie closed 3 months ago

i5okie commented 5 months ago

Digital Trust Services Loki is experiencing intermittent stability issues with some queries. From briefly looking into it, it appears that the query cache and Loki querier require further tuning.

loki-queirer pods go into CrashLoopBackOff causing queries to fail, or return incomplete results.

Initially found that Memcached transactions failed with the error SERVER_ERROR object too large for cache I believe I've resolved that issue by adding -I 32m to the Memcached arguments.

However there are still errors which should be investigated further.

Seeing the following errors from the memcachedchunks pod:

Failed to write, and not due to blocking: Connection reset by peer
Failed to write, and not due to blocking: Connection reset by peer
Failed to write, and not due to blocking: Connection reset by peer
Failed to write, and not due to blocking: Broken pipe

And errors from the loki-querier pod:

level=error ts=2024-06-06T20:41:46.457205752Z caller=frontend_processor.go:151 component=querier msg="error processing requests" err=EOF
level=error ts=2024-06-06T20:41:46.457309899Z caller=frontend_processor.go:74 component=querier msg="error processing requests" address=x.x.247.187:9095 err="rpc error: code = Canceled desc = context canceled"
level=error ts=2024-06-06T20:41:46.457312625Z caller=frontend_processor.go:74 component=querier msg="error processing requests" address=x.x.247.187:9095 err="rpc error: code = Canceled desc = context canceled"
level=error ts=2024-06-06T20:41:46.457323842Z caller=gateway_client.go:515 index-store=tsdb-2024-04-23 msg="client do failed for instance dns:///loki-index-gateway:9095" err="rpc error: code = Canceled desc = context canceled"
level=error ts=2024-06-06T20:41:46.457328729Z caller=frontend_processor.go:74 component=querier msg="error processing requests" address=x.x.247.187:9095 err="rpc error: code = Canceled desc = context canceled"
level=error ts=2024-06-06T20:41:46.457338177Z caller=gateway_client.go:515 index-store=tsdb-2024-04-23 msg="client do failed for instance dns:///loki-index-gateway:9095" err="rpc error: code = Canceled desc = context canceled"
level=error ts=2024-06-06T20:41:46.457345526Z caller=gateway_client.go:515 index-store=tsdb-2024-04-23 msg="client do failed for instance dns:///loki-index-gateway:9095" err="rpc error: code = Canceled desc = context canceled"
<queries>
level=error ts=2024-06-06T20:41:46.457411169Z caller=frontend_processor.go:151 component=querier msg="error processing requests" err=EOF
level=error ts=2024-06-06T20:41:46.457408078Z caller=frontend_processor.go:151 component=querier msg="error processing requests" err=EOF
level=warn ts=2024-06-06T20:41:50.775536323Z caller=pool.go:250 index-store=tsdb-2024-04-23 msg="removing index gateway failing healthcheck" addr=dns:///loki-index-gateway:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
level=error ts=2024-06-06T20:42:28.68163703Z caller=resolver.go:87 msg="failed to lookup IP addresses" host=loki-memcachedindexqueries err="lookup loki-memcachedindexqueries on x.x.0.10:53: no such host"
i5okie commented 3 months ago

Loki has been re-deployed with adjusted resource requests and limits.