etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.85k stars 9.77k forks source link

etcd 3.2.1 Troubling CPU Usage Pattern #8491

Closed iherbmatt closed 7 years ago

iherbmatt commented 7 years ago

Hello,

We have recently upgraded to kube-aws 0.9.8 and are utilizing etcd 3.2.1 and have tested etcd 3.2.6, both versions have been installed with a 3-node etcd cluster with nodes having 2 cores and 8GB of RAM.

What's troubling is that we are only running a single application on the cluster and it's using more and more CPU as time goes by. Here is a sample showing the last week from the date of cluster start-up to now:

image

As you can see the CPU has not fluctuated - it has only increased steadily over the last few days. This is troubling because we have older clusters running etcd 3.1.3 and they are increasing faster. We figured we would test with a cluster using etcd 3.2.1 to see if that would fix the problem, but it doesn't - it just postponed the inevitable: an unstable cluster.

In order to fix the problem we need to terminate the nodes and let them rebuild and resync with the other members, or reboot them.

We created the K8s cluster with the following etcd configs: 3 etcd nodes m4.large (2 Cores, 8GB RAM) 50GB root volume [general ssd (gp2)] 200GB data volume [general ssd (gp2)] auto-recovery: true auto-snapshot: true encrypted volumes: true

Please somebody help us with this.

Thank you,

Matt

redbaron commented 7 years ago

@iherbmatt , there is a report about odd memory usage pattern https://github.com/coreos/etcd/issues/8472 do you see similar behavior on your setup?

iherbmatt commented 7 years ago

From startup to a few days later most RAM is used. Some is moved to cached memory while others are taken by other processed - mostly Docker and etcd.

On Sep 4, 2017 12:08 PM, "Maxim Ivanov" notifications@github.com wrote:

@iherbmatt https://github.com/iherbmatt , there is a report about odd memory usage pattern #8472 https://github.com/coreos/etcd/issues/8472 do you see similar behavior on your setup?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coreos/etcd/issues/8491#issuecomment-327016449, or mute the thread https://github.com/notifications/unsubscribe-auth/AWH4rpn-WHQHEkL-OgeI-XsI9_dsWuo9ks5sfEq2gaJpZM4PLUQb .

-- *The information contained in this message is the sole and exclusive property of iHerb Inc. and may be privileged and confidential. It may not be disseminated or distributed to persons or entities other than the ones intended without the written authority of *iHerb Inc. If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it. Do not open any attachments. Please delete it immediately from your system and notify the sender promptly by e-mail that you have done so.

heyitsanthony commented 7 years ago

@iherbmatt what is the RPC rate over time for 3.2? What is the memory utilization over time? Are there any errors or warnings in the etcd server logs?

Also, please use >=3.1.5 for 3.1; there's a memory leak on linearizable reads

iherbmatt commented 7 years ago

Hi Anthony,

I'm not extremely familiar with etcd. How can I get this information for you? Also, I'm wondering if it's the automatic snapshots that are causing the issue. I'm testing another cluster with automatic snapshots and automatic recovery disabled. For about 2 hours I'm seeing the CPUs for each of 3 etcd nodes hovering around 1% - previously they were about 5% (v. 3.2.1) and ~20% (v. 3.1.3).

Thanks,

Matt

Matt Poland | Software Developer iHerb Inc - Natural Products & More www.iherb.com http://www.iherb.com | matt-p@iherb.com matt-p@iherb.com

On Tue, Sep 5, 2017 at 12:22 AM, Anthony Romano notifications@github.com wrote:

@iherbmatt https://github.com/iherbmatt what is the RPC rate over time for 3.2? What is the memory utilization over time? Are there any errors or warnings in the etcd server logs?

Also, please use >=3.1.5 for 3.1; there's a memory leak on linearizable reads

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coreos/etcd/issues/8491#issuecomment-327092974, or mute the thread https://github.com/notifications/unsubscribe-auth/AWH4rvc2RTjaVAO4MlVIz1Q0NWVSfexIks5sfPaegaJpZM4PLUQb .

-- *The information contained in this message is the sole and exclusive property of iHerb Inc. and may be privileged and confidential. It may not be disseminated or distributed to persons or entities other than the ones intended without the written authority of *iHerb Inc. If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it. Do not open any attachments. Please delete it immediately from your system and notify the sender promptly by e-mail that you have done so.

heyitsanthony commented 7 years ago

It looks like kube-aws is taking snapshots every minute on every member according to https://github.com/kubernetes-incubator/kube-aws/blob/master/core/controlplane/config/templates/cloud-config-etcd#L231

This is about 90x more frequent than the etcd-operator default policy and might account for the increased CPU load. It could be triggering #8009, where the etcd backend needs to be defragmented when there are frequent snapshots.

iherbmatt commented 7 years ago

When I disable automatic snapshots and disaster recovery the cpu remains around 1-1.5%. It's obvious there's a bug in that logic somewhere.

Thank you!

Matt Poland | Software Developer iHerb Inc - Natural Products & More www.iherb.com http://www.iherb.com | matt-p@iherb.com matt-p@iherb.com

On Tue, Sep 5, 2017 at 1:58 AM, Anthony Romano notifications@github.com wrote:

It looks like kube-aws is taking snapshots every minute on every member according to https://github.com/kubernetes-incubator/kube-aws/blob/ master/core/controlplane/config/templates/cloud-config-etcd#L231

This is about 90x more frequent than the etcd-operator default policy and might account for the increased CPU load. It could be triggering #8009 https://github.com/coreos/etcd/issues/8009, where the etcd backend needs to be defragmented when there are frequent snapshots.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coreos/etcd/issues/8491#issuecomment-327114878, or mute the thread https://github.com/notifications/unsubscribe-auth/AWH4rgTpMKuWanQrHJZXBYC5N2giKAG6ks5sfQ1MgaJpZM4PLUQb .

-- *The information contained in this message is the sole and exclusive property of iHerb Inc. and may be privileged and confidential. It may not be disseminated or distributed to persons or entities other than the ones intended without the written authority of *iHerb Inc. If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it. Do not open any attachments. Please delete it immediately from your system and notify the sender promptly by e-mail that you have done so.

flah00 commented 7 years ago

Based on input from @heyitsanthony, I updated userdata/cloud-config-etcd, so backups run every 5m instead of 1m. CPU has stabilized.

xiang90 commented 7 years ago

thanks for the update. Closing this issue.