Closed jolson490 closed 6 years ago
How do I figure out what is causing the 123058 occurrences of etcd_http_failed_total on 192.168.225.40? (That etcd node is the leader of the 5 etcd nodes.)
@jolson490 One idea is to use something like wireshark or tcpdump to monitor the traffic and get a better idea of the source payload etc. Being that you have http traffic wireshark would give you a very good idea of what is going on. I hope that helps.
How do I figure out what is causing the 123058 occurrences of etcd_http_failed_total on 192.168.225.40? (That etcd node is the leader of the 5 etcd nodes.)
I would first see if etcd v2 client request has ever failed (those metrics are from v2 request failures) -- probably logged somewhere in Kubernetes API server. And then look at etcd server logs, to see if it prints out any warnings.
On the etcd leader node, here's the beginning of the output from running
If that metrics still grows, I would also collect it over time and try to find the timestamps when error spikes happen, and see if server prints out any errors or warnings.
@jolson490 Any updates?
Thank you guys very much for the ideas.
Our K8s cluster and etcd cluster overall seems healthy and to be working (though we've also seen the HighNumberOfFailedProposals
etcd-related prom rule fire on a couple of our clusters), so FWIW this isn't our top priority, but once we look into this further if we find anything we'll definitely share the findings here.
By the way, I found it interesting that it's only the QGET requests that are failing (all the GET requests work) - e.g.:
etcd_http_failed_total{code="404",endpoint="api",instance="192.168.225.40:2379",job="etcd",method="DELETE",namespace="monitoring",service="etcd-k8s"} 19
etcd_http_failed_total{code="404",endpoint="api",instance="192.168.225.40:2379",job="etcd",method="GET",namespace="monitoring",service="etcd-k8s"} 2
etcd_http_failed_total{code="404",endpoint="api",instance="192.168.225.40:2379",job="etcd",method="QGET",namespace="monitoring",service="etcd-k8s"} 161,464
etcd_http_failed_total{code="412",endpoint="api",instance="192.168.225.40:2379",job="etcd",method="PUT",namespace="monitoring",service="etcd-k8s"} 37
etcd_http_received_total{endpoint="api",instance="192.168.225.40:2379",job="etcd",method="DELETE",namespace="monitoring",service="etcd-k8s"} 99
etcd_http_received_total{endpoint="api",instance="192.168.225.40:2379",job="etcd",method="GET",namespace="monitoring",service="etcd-k8s"} 75,290
etcd_http_received_total{endpoint="api",instance="192.168.225.40:2379",job="etcd",method="QGET",namespace="monitoring",service="etcd-k8s"} 493,036
etcd_http_received_total{endpoint="api",instance="192.168.225.40:2379",job="etcd",method="PUT",namespace="monitoring",service="etcd-k8s"} 235
So 32.75% of the QGET requests in this cluster failed.
@jolson490 Sure.
QGET
is quorum read request, which may fail from various reasons. Server logs from 192.168.225.40:2379
would be helpful. That node might have been isolated.
Please reopen or create a new issue if you still need help.
Thanks.
Follow-up for anyone who finds this later: we used Wireshark to trace the "failures" to Calico polling etcd for network metadata, some elements of which didn't exist (eg bgp as_num
), which produced a legit 404. So, if you have an app that legitimately creates 404s when looking for data that may or may not exist, you may need to tailor this alert to ignore that normal behavior.
> GET /v2/keys/calico/bgp/v1/host/[redacted]/as_num?quorum=true&recursive=false&sorted=false HTTP/1.1
> Host: 127.0.0.1:2379
> User-Agent: Go-http-client/1.1
> Accept-Encoding: gzip
< HTTP/1.1 404 Not Found
< Content-Type: application/json
< X-Etcd-Cluster-Id: [redacted]
< X-Etcd-Index: 1118
< Date: Thu, 14 Jun 2018 20:19:33 GMT
< Content-Length: 125
<
< {"errorCode":100,"message":"Key not found","cause":"/calico/bgp/v1/host/[redacted]/as_num","index":1118}
@gyuho @jolson490 can you reopen please? Havin same issue with
etcd_version = "3.2.17"
k8s_version = "1.10.2"
This is reproducible across multiple clusters in diff regions. (I also was asking on SO)
I've also seen HighNumberOfFailedProposals
.
The high error rate for QGET is constant for weeks.
@gyuho etcd server logs from affected node do not show anythin except events like:
July 16th 2018, 16:17:28.925 | 2018-07-16 13:17:28.925363 I \| mvcc: store.index: compact 11115769
July 16th 2018, 16:17:28.928 | 2018-07-16 13:17:28.928357 I \| mvcc: finished scheduled compaction at 11115769 (took 1.749899ms)
or
July 16th 2018, 11:12:26.122 | 2018-07-16 08:12:26.122639 I \| wal: segmented wal file /var/etcd/data/member/wal/00000000000001c9-0000000000f3baf3.wal is created
July 16th 2018, 11:12:55.433 | 2018-07-16 08:12:55.433334 I \| pkg/fileutil: purged file /var/etcd/data/member/wal/00000000000001c4-0000000000f16593.wal successfully
I checked 24h back - no warnings/errors. Other etcd nodes - also no warnings/errors
The affected node is StateLeader, other show lower error rate for QGET.
Per etcd PR 9706, the etcd_http
metrics (and thus prom rules HighNumberOfFailedHTTPRequests
& HTTPRequestsSlow
) are deprecated in etcd3.
We're planning on customizing our own copy of HighNumberOfFailedGRPCRequests
to change the following line...:
expr: sum(rate(grpc_server_handled_total{grpc_code!="OK",job="etcd"}[5m])) BY (grpc_service, grpc_method)
...to this:
expr: sum(rate(grpc_server_handled_total{grpc_code!="OK",grpc_code!="Unavailable",job="etcd"}[5m])) BY (grpc_service, grpc_method)
FWIW we've occasionally had HighNumberOfFailedProposals
fire in some of our K8s clusters, but it hasn't happened for a couple months.
So I don't think there's anything here that warrants this etcd issue being re-opened - I don't know if you (@max-lobur) are also using calico, but I haven't looked into this enough to formulate any reason to suspect these metrics or prom rules are wrong/invalid.
super helpful! many thanks!! Yes we're on calico and etcd3, I'm happy to replace these alerts
Hello. Can someone please help me figure out how to determine the cause/source of each failure/error for the
etcd_http_failed_total
metric? (I know this probably isn't a bug in etcd - please let me know if there's somewhere else I should redirect this.)In AWS I'm running a Kubernetes cluster (for masters & agents) and an etcd cluser. I'm running kube-prometheus, and an alert is firing for the
HighNumberOfFailedHTTPRequests
rule - reference info:In alertmanager I get the following (which shows
0.327586206896552
as the valueIn prometheus I did the following 2 queries:
sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method)
= 0.07037037037037037 for QGETsum(rate(etcd_http_received_total{job="etcd"}[5m])) BY (method)
= 0.21481481481481482 for QGET (0.07037037037037037 / 0.21481481481481482 = 0.327586206896552)And when I simply query for
etcd_http_failed_total
:How do I figure out what is causing the
123058
occurrences ofetcd_http_failed_total
on192.168.225.40
? (That etcd node is the leader of the 5 etcd nodes.)More info:
I use custom terraform cloud-config scripts to setup the etcd service on each etcd node - here's a snippet:
On the etcd leader node, here's the beginning of the output from running
journalctl -u etcd-member
:To give an e.g., from a slightly different angle:
But regarding the
diff
command above, the challenge I'm trying to solve is (even with the etcd log level set to debug) I can't find anything to tell me which http requests resulted in an error (I did also try looking in the logs of the rkt container that is started by etcd-wrapper, but I didn't find anything there either):Any suggestions would be greatly appreciated. thanks!