etcd_http_failed_total: how find each error

jolson490 commented 6 years ago

Hello. Can someone please help me figure out how to determine the cause/source of each failure/error for the etcd_http_failed_total metric? (I know this probably isn't a bug in etcd - please let me know if there's somewhere else I should redirect this.)

In AWS I'm running a Kubernetes cluster (for masters & agents) and an etcd cluser. I'm running kube-prometheus, and an alert is firing for the HighNumberOfFailedHTTPRequests rule - reference info:

the "High number of failed requests" section on the Developing Prometheus alerts for etcd coreos blog.
github repos: coreos/etcd/etcd3_alert.rules and coreos/prometheus-operator/kube-prometheus/etcd3.rules.yaml.

In alertmanager I get the following (which shows 0.327586206896552 as the value screen shot 2018-04-20 at 11 57 18 am

In prometheus I did the following 2 queries:

sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) = 0.07037037037037037 for QGET
sum(rate(etcd_http_received_total{job="etcd"}[5m])) BY (method) = 0.21481481481481482 for QGET (0.07037037037037037 / 0.21481481481481482 = 0.327586206896552)

And when I simply query for etcd_http_failed_total: prom

How do I figure out what is causing the 123058 occurrences of etcd_http_failed_total on 192.168.225.40? (That etcd node is the leader of the 5 etcd nodes.)

More info:

I use custom terraform cloud-config scripts to setup the etcd service on each etcd node - here's a snippet:

            Environment="ETCD_IMAGE_TAG=v3.3"
            ExecStart=/usr/lib/coreos/etcd-wrapper \
              --name ${InstanceName} \
              --listen-client-urls=http://${InstanceIP}:2379 \
              --advertise-client-urls=http://${InstanceIP}:2379 \
              --listen-peer-urls=http://${InstanceIP}:2380 \
              --initial-advertise-peer-urls=http://${InstanceIP}:2380 \
              --initial-cluster=${EtcdInitialCluster} \
              --initial-cluster-state=new \
              --debug

On the etcd leader node, here's the beginning of the output from running journalctl -u etcd-member:

-- Logs begin at Wed 2017-04-05 15:54:37 UTC, end at Fri 2018-04-20 16:22:35 UTC. --
Mar 25 01:44:13 ip-192-168-225-40 systemd[1]: Starting etcd (System Application Container)...
Mar 25 01:44:13 ip-192-168-225-40 etcd-wrapper[937]: ++ id -u etcd
Mar 25 01:44:13 ip-192-168-225-40 etcd-wrapper[937]: + exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid --trust-keys-from-https --mount volume=coreos-systemd-dir,target=/run/systemd/system --volume coreos-systemd-dir,kind=host,source=/run/systemd/system,readOnly=true --mount volume=coreos-notify,target=/run/systemd/notify --volume coreos-notify,kind=host,source=/run/systemd/notify --set-env=NOTIFY_SOCKET=/run/systemd/notify --volume coreos-data-dir,kind=host,source=/var/lib/etcd,readOnly=false --volume coreos-etc-ssl-certs,kind=host,source=/etc/ssl/certs,readOnly=true --volume coreos-usr-share-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume coreos-etc-hosts,kind=host,source=/etc/hosts,readOnly=true --volume coreos-etc-resolv,kind=host,source=/etc/resolv.conf,readOnly=true --mount volume=coreos-data-dir,target=/var/lib/etcd --mount volume=coreos-etc-ssl-certs,target=/etc/ssl/certs --mount volume=coreos-usr-share-certs,target=/usr/share/ca-certificates --mount volume=coreos-etc-hosts,target=/etc/hosts --mount volume=coreos-etc-resolv,target=/etc/resolv.conf --inherit-env --stage1-from-dir=stage1-fly.aci quay.io/coreos/etcd:v3.3 --user=232 -- --name etcd-1 --listen-client-urls=http://192.168.225.40:2379 --advertise-client-urls=http://192.168.225.40:2379 --listen-peer-urls=http://192.168.225.40:2380 --initial-advertise-peer-urls=http://192.168.225.40:2380 --initial-cluster=etcd-1=http://192.168.225.40:2380,etcd-2=http://192.168.225.41:2380,etcd-3=http://192.168.225.42:2380,etcd-4=http://192.168.225.112:2380,etcd-5=http://192.168.225.113:2380 --initial-cluster-state=new
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.502148 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.556851 W | pkg/flags: recognized environment variable ETCD_NAME, but unused: shadowed by corresponding flag
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.556863 W | pkg/flags: unrecognized environment variable ETCD_USER=etcd
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.556873 W | pkg/flags: unrecognized environment variable ETCD_IMAGE_TAG=v3.3
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.556902 I | etcdmain: etcd Version: 3.3.2
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.556910 I | etcdmain: Git SHA: c9d46ab37
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.556916 I | etcdmain: Go Version: go1.9.4
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.556923 I | etcdmain: Go OS/Arch: linux/amd64
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.556931 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.557002 W | etcdmain: found invalid file/dir data under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.557153 I | embed: listening for peers on http://192.168.225.40:2380
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.557238 I | embed: listening for client requests on 192.168.225.40:2379
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.625289 I | etcdserver: name = etcd-1
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.625314 I | etcdserver: data dir = /var/lib/etcd
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.625328 I | etcdserver: member dir = /var/lib/etcd/member
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.625335 I | etcdserver: heartbeat = 100ms
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.625341 I | etcdserver: election = 1000ms
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.625347 I | etcdserver: snapshot count = 100000
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.625360 I | etcdserver: advertise client URLs = http://192.168.225.40:2379
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.625368 I | etcdserver: initial advertise peer URLs = http://192.168.225.40:2380
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.625401 I | etcdserver: initial cluster = etcd-1=http://192.168.225.40:2380,etcd-2=http://192.168.225.41:2380,etcd-3=http://192.168.225.42:2380,etcd-4=http://192.168.225.112:2380,etcd-5=http://192.168.225.113:2380
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.643802 I | etcdserver: starting member caee3cd26c1e54ba in cluster 8ec3f139e70bd26a
Mar 25 01:44:21 ip-192-168-225-40 etcd-wrapper[937]: 2018-03-25 01:44:21.643844 I | raft: caee3cd26c1e54ba became follower at term 0

To give an e.g., from a slightly different angle:

core@ip-192-168-225-40 ~/k $ curl http://ip-192-168-225-40:2379/config/local/log -XPUT -d '{"Level":"DEBUG"}'

core@ip-192-168-225-40 ~/k $ journalctl -u etcd-member > etcd-log-1.txt

core@ip-192-168-225-40 ~/k $ curl --silent -l http://ip-192-168-225-40:2379/metrics | grep 'etcd_http_failed_total{code="404",method="GET"}'
etcd_http_failed_total{code="404",method="GET"} 2

# How to cause an error:
core@ip-192-168-225-40 ~/k $ curl http://ip-192-168-225-40:2379/v2/keys/does_Not_Exist
{"errorCode":100,"message":"Key not found","cause":"/does_Not_Exist","index":3769}

# Verify the previous curl cmd caused the error count to go up by 1:
core@ip-192-168-225-40 ~/k $ curl --silent -l http://ip-192-168-225-40:2379/metrics | grep 'etcd_http_failed_total{code="404",method="GET"}'
etcd_http_failed_total{code="404",method="GET"} 3

core@ip-192-168-225-40 ~/k $ journalctl -u etcd-member > etcd-log-2.txt

core@ip-192-168-225-40 ~/k $ diff etcd-log-1.txt etcd-log-2.txt | egrep "Logs begin at|does_Not_Exist"
< -- Logs begin at Fri 2018-04-13 22:49:48 UTC, end at Fri 2018-04-20 16:44:31 UTC. --
> -- Logs begin at Fri 2018-04-13 22:49:48 UTC, end at Fri 2018-04-20 16:44:58 UTC. --
> Apr 20 16:44:49 ip-192-168-225-40 etcd-wrapper[897]: 2018-04-20 16:44:49.019552 D | etcdserver/api/v2http: [GET] /v2/keys/does_Not_Exist remote:192.168.225.40:52486

But regarding the diff command above, the challenge I'm trying to solve is (even with the etcd log level set to debug) I can't find anything to tell me which http requests resulted in an error (I did also try looking in the logs of the rkt container that is started by etcd-wrapper, but I didn't find anything there either):

Any suggestions would be greatly appreciated. thanks!

hexfusion commented 6 years ago

How do I figure out what is causing the 123058 occurrences of etcd_http_failed_total on 192.168.225.40? (That etcd node is the leader of the 5 etcd nodes.)

@jolson490 One idea is to use something like wireshark or tcpdump to monitor the traffic and get a better idea of the source payload etc. Being that you have http traffic wireshark would give you a very good idea of what is going on. I hope that helps.

gyuho commented 6 years ago

How do I figure out what is causing the 123058 occurrences of etcd_http_failed_total on 192.168.225.40? (That etcd node is the leader of the 5 etcd nodes.)

I would first see if etcd v2 client request has ever failed (those metrics are from v2 request failures) -- probably logged somewhere in Kubernetes API server. And then look at etcd server logs, to see if it prints out any warnings.

On the etcd leader node, here's the beginning of the output from running

If that metrics still grows, I would also collect it over time and try to find the timestamps when error spikes happen, and see if server prints out any errors or warnings.

gyuho commented 6 years ago

@jolson490 Any updates?

jolson490 commented 6 years ago

Thank you guys very much for the ideas. Our K8s cluster and etcd cluster overall seems healthy and to be working (though we've also seen the HighNumberOfFailedProposals etcd-related prom rule fire on a couple of our clusters), so FWIW this isn't our top priority, but once we look into this further if we find anything we'll definitely share the findings here.

By the way, I found it interesting that it's only the QGET requests that are failing (all the GET requests work) - e.g.:

etcd_http_failed_total{code="404",endpoint="api",instance="192.168.225.40:2379",job="etcd",method="DELETE",namespace="monitoring",service="etcd-k8s"}   19
etcd_http_failed_total{code="404",endpoint="api",instance="192.168.225.40:2379",job="etcd",method="GET",namespace="monitoring",service="etcd-k8s"}  2
etcd_http_failed_total{code="404",endpoint="api",instance="192.168.225.40:2379",job="etcd",method="QGET",namespace="monitoring",service="etcd-k8s"} 161,464
etcd_http_failed_total{code="412",endpoint="api",instance="192.168.225.40:2379",job="etcd",method="PUT",namespace="monitoring",service="etcd-k8s"}  37

etcd_http_received_total{endpoint="api",instance="192.168.225.40:2379",job="etcd",method="DELETE",namespace="monitoring",service="etcd-k8s"}    99
etcd_http_received_total{endpoint="api",instance="192.168.225.40:2379",job="etcd",method="GET",namespace="monitoring",service="etcd-k8s"}   75,290
etcd_http_received_total{endpoint="api",instance="192.168.225.40:2379",job="etcd",method="QGET",namespace="monitoring",service="etcd-k8s"}  493,036
etcd_http_received_total{endpoint="api",instance="192.168.225.40:2379",job="etcd",method="PUT",namespace="monitoring",service="etcd-k8s"}   235

So 32.75% of the QGET requests in this cluster failed.

gyuho commented 6 years ago

@jolson490 Sure.

QGET is quorum read request, which may fail from various reasons. Server logs from 192.168.225.40:2379 would be helpful. That node might have been isolated.

gyuho commented 6 years ago

Please reopen or create a new issue if you still need help.

Thanks.

dbanttari commented 6 years ago

Follow-up for anyone who finds this later: we used Wireshark to trace the "failures" to Calico polling etcd for network metadata, some elements of which didn't exist (eg bgp as_num), which produced a legit 404. So, if you have an app that legitimately creates 404s when looking for data that may or may not exist, you may need to tailor this alert to ignore that normal behavior.

> GET /v2/keys/calico/bgp/v1/host/[redacted]/as_num?quorum=true&recursive=false&sorted=false HTTP/1.1
> Host: 127.0.0.1:2379
> User-Agent: Go-http-client/1.1
> Accept-Encoding: gzip

< HTTP/1.1 404 Not Found
< Content-Type: application/json
< X-Etcd-Cluster-Id: [redacted]
< X-Etcd-Index: 1118
< Date: Thu, 14 Jun 2018 20:19:33 GMT
< Content-Length: 125
<
< {"errorCode":100,"message":"Key not found","cause":"/calico/bgp/v1/host/[redacted]/as_num","index":1118}

max-lobur commented 6 years ago

@gyuho @jolson490 can you reopen please? Havin same issue with

etcd_version = "3.2.17"
k8s_version = "1.10.2"

This is reproducible across multiple clusters in diff regions. (I also was asking on SO)

I've also seen HighNumberOfFailedProposals. The high error rate for QGET is constant for weeks.

max-lobur commented 6 years ago

@gyuho etcd server logs from affected node do not show anythin except events like:

July 16th 2018, 16:17:28.925 | 2018-07-16 13:17:28.925363 I \| mvcc: store.index: compact 11115769
July 16th 2018, 16:17:28.928 | 2018-07-16 13:17:28.928357 I \| mvcc: finished scheduled compaction at 11115769 (took 1.749899ms)

or

July 16th 2018, 11:12:26.122 | 2018-07-16 08:12:26.122639 I \| wal: segmented wal file /var/etcd/data/member/wal/00000000000001c9-0000000000f3baf3.wal is created
July 16th 2018, 11:12:55.433 | 2018-07-16 08:12:55.433334 I \| pkg/fileutil: purged file /var/etcd/data/member/wal/00000000000001c4-0000000000f16593.wal successfully

I checked 24h back - no warnings/errors. Other etcd nodes - also no warnings/errors

The affected node is StateLeader, other show lower error rate for QGET.

jolson490 commented 6 years ago

Per etcd PR 9706, the etcd_http metrics (and thus prom rules HighNumberOfFailedHTTPRequests & HTTPRequestsSlow) are deprecated in etcd3.

We're planning on customizing our own copy of HighNumberOfFailedGRPCRequests to change the following line...: expr: sum(rate(grpc_server_handled_total{grpc_code!="OK",job="etcd"}[5m])) BY (grpc_service, grpc_method) ...to this: expr: sum(rate(grpc_server_handled_total{grpc_code!="OK",grpc_code!="Unavailable",job="etcd"}[5m])) BY (grpc_service, grpc_method)

FWIW we've occasionally had HighNumberOfFailedProposals fire in some of our K8s clusters, but it hasn't happened for a couple months.

So I don't think there's anything here that warrants this etcd issue being re-opened - I don't know if you (@max-lobur) are also using calico, but I haven't looked into this enough to formulate any reason to suspect these metrics or prom rules are wrong/invalid.

max-lobur commented 6 years ago

super helpful! many thanks!! Yes we're on calico and etcd3, I'm happy to replace these alerts

etcd-io / etcd

etcd_http_failed_total: how find each error #9596