Closed Lokicity closed 6 years ago
sync duration of 23.383339906s, expected less than 1s
Is this using HDD?
etcd is I/O intensive, persisting Raft entries with application data on disk. You need better disk.
yeah, something doesn't look right with the sync duration.
Thank you for getting back so promptly. Is disk the only reason that cause sync duration high and etcd timeout? Why would etcd timeout so fast at the begining of cluster creation? Is Kubernetes stressing etcd too much? Do we have guidelines on CPU/Memory for etcd/k8s-apiserver VMs?
Thanks a lot.
I am either using t2.medium or m4.large instance on AWS, and according to aws documentation, those are EBS-only instances. I think it is slower than SSD.
We have general hardward setting guide https://github.com/coreos/etcd/blob/master/Documentation/op-guide/hardware.md, but not Kubernetes specific.
The metrics
grpc_server_handled_total{grpc_code="Unavailable",grpc_method="LeaseGrant",grpc_service="etcdserverpb.Lease",grpc_type="unary"} 64 grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Range",grpc_service="etcdserverpb.KV",grpc_type="unary"} 27 grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Txn",grpc_service="etcdserverpb.KV",grpc_type="unary"} 56 grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 166 grpc_server_handled_total{grpc_code="Unknown",grpc_method="Txn",grpc_service="etcdserverpb.KV",grpc_type="unary"} 1
indicates a bunch of v3 requests timed out, but these workloads are not that intensive. etcd with better disk should be able to handle much more pressing workloads.
for ebs backed instance, you might consider to use Provisioned IOPS (SSD) volumes offer storage with consistent and low-latency performance, and are designed for I/O intensive applications such as large relational or NoSQL databases.
for better I/O performance.
My Kubernetes master and etcd is running on the same VM is running into some intermittent issue with "etcd timing out" and "apiserver received an error that is not an metav1.Status: etcdserver: request timed out"
Attach is my etcd metrics. Can somebody point to me if there are any obvious problems in this metrics? Like I need to switch etcd to use better disk, etc?