Boostport / kubernetes-vault

Use Vault to store secrets for Kubernetes!
Apache License 2.0
941 stars 137 forks source link

kubernetes-vault hogs CPU #27

Closed travis-bear closed 7 years ago

travis-bear commented 7 years ago

After following the steps in the quickstart using a local minikube for the kubernetes installation, the kubernetes-vault deployment pinned the laptop CPU at 200%. This seems excessive.

F21 commented 7 years ago

Hmm, that's super strange. I did not see this problem while testing with minikube. Can you please list the specs of your machine?

travis-bear commented 7 years ago

minikube running in Virtualbox. Vbox VM:

This is running on a modern macbook pro, 16GB mem, SSD storage.

F21 commented 7 years ago

Can you also show me what your deployment yaml looks like? Also, what version of virtualbox are you using? And by pinning the laptop CPU, is this the physical laptop cpu or the virtualbox cpu?

F21 commented 7 years ago

I just tried deploying kubernetes-vault into minikube and I am not able to reproduce this problem. In your case, does the deployment run successfully?

I am using Windows 10 64-bit and VirtualBox 5.1.18. Minikube is set to its default 2048 MB RAM and 2 CPU.

Specs of the machine are: Skylake i7 6700, 32GB RAM and SSD.

travis-bear commented 7 years ago

The deployment yaml is just the vanilla kubernetes-vault.yaml that comes in the quickstart guide with the vault token updated for my installation.

I had to go through the quickstart steps a few times before getting the certificates set up properly. (In spite of the good documentation, there were a few places I had to make educated guesses about what to copy/paste). But in the end, it seems to be working properly -- no errors showing in the kubernetes console, and the vault tokens properly distributed to the running sample apps.

When scaling back from 3 replicas to 1, the CPU use drops down to 100%.

I can experiment with running minikube on Linux to see if the same sort of problem comes up.

F21 commented 7 years ago

What version of minikube are you using (I am currently using 0.18.0, but was testing on previous versions for dev)? Unfortunately, I don't have access to a Mac, so I am unable to try that out.

Can you also open an issue to point out parts of the documentation where there is ambiguity? I'd love to have those bits fixed.

Let me know how you go with trying it on Linux.

travis-bear commented 7 years ago

I finally had a chance to test this on Linux. Each running kubernetes-vault pod consumed 100% of a CPU core. Shutting down the kubernetes-vault pods caused the CPU use to return to normal.

travis@noir ~/git/github/kubernetes-vault $ kubectl version
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.0", GitCommit:"fff5156092b56e6bd60fff75aad4dc9de6b6ef37", GitTreeState:"clean", BuildDate:"2017-03-28T16:36:33Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.0", GitCommit:"fff5156092b56e6bd60fff75aad4dc9de6b6ef37", GitTreeState:"dirty", BuildDate:"2017-04-07T20:43:50Z", GoVersion:"go1.7.1", Compiler:"gc", Platform:"linux/amd64"}

minikube version: v0.18.0

travis@noir ~/git/github/kubernetes-vault $ cat /etc/linuxmint/info 
RELEASE=17.1
CODENAME=rebecca
EDITION="Cinnamon 64-bit"
DESCRIPTION="Linux Mint 17.1 Rebecca"
DESKTOP=Gnome
TOOLKIT=GTK
NEW_FEATURES_URL=http://www.linuxmint.com/rel_rebecca_cinnamon_whatsnew.php
RELEASE_NOTES_URL=http://www.linuxmint.com/rel_rebecca_cinnamon.php
USER_GUIDE_URL=help:linuxmint
GRUB_TITLE=Linux Mint 17.1 Cinnamon 64-bit
F21 commented 7 years ago

Thanks for testing this. I will find an older PC and test it out.

F21 commented 7 years ago

hey @travis-bear

What hardware are you using when testing on linux? When are you seeing massive CPU usage? Is this when spinning up the sample app pods or just kubernetes vault?

I set up minikube v0.18.0 on a Core i7 920 (ca 2009) with 12GB of RAM on Win 10 64-bit and the CPU never went above 20%.

F21 commented 7 years ago

Can you add heapster and grafana to your minikube instance?

Run:

$ minikube addons enable heapster
$ minikube addons open heapster

Once it's ready, go to the dashboard and drill down into a kubernetes-vault pod.

Here's what mine looks like: kv

travis-bear commented 7 years ago

I have 8 core Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

The problems start when running kubernetes-vault. It's not necessary to start the sample app to get the high CPU use. Heapster seems to indicate it's vault itself that is taking two full CPU cores. Makes me wonder if kubernetes-vault is slamming vault with large numbers of expensive calls.

pods_cpu

F21 commented 7 years ago

Hmm, that seems really odd. Kubernetes-vault should not talk to Vault, other then when first starting up to verify the token.

The only time it talks to Vault would be when an init container appears and kubernetes-vault requests a secret or when it needs to renew the token and certificates.

Here's my resource usage for the default namespace (vault + kubernetes-vault):

kv

Can you post the logs for Vault and one of the Kubernetes-Vault (make sure you delete and then apply the deployment again) pods (this is accessible within the kubernetes dashboard)?

My Vault log:

2017-04-23T23:21:32.79212042Z token has already been authenticated with the CLI, so you can
2017-04-23T23:21:32.792125165Z immediately begin using the Vault CLI.
2017-04-23T23:21:32.792129816Z 
2017-04-23T23:21:32.792134429Z The only step you need to take is to set the following
2017-04-23T23:21:32.792139063Z environment variables:
2017-04-23T23:21:32.792146527Z 
2017-04-23T23:21:32.792151187Z     export VAULT_ADDR='http://0.0.0.0:8200'
2017-04-23T23:21:32.792155836Z 
2017-04-23T23:21:32.792160304Z The unseal key and root token are reproduced below in case you
2017-04-23T23:21:32.792164975Z want to seal/unseal the Vault or play with authentication.
2017-04-23T23:21:32.792169691Z 
2017-04-23T23:21:32.792174229Z Unseal Key: AS68ER9Nx14rS2LSu4Q0XERtO57voDFP7LOSuPDzJbo=
2017-04-23T23:21:32.792179002Z Root Token: vault-root-token
2017-04-23T23:21:32.792183619Z 
2017-04-23T23:21:32.792188144Z ==> Vault server started! Log data will stream in below:
2017-04-23T23:21:32.792192961Z 
2017-04-23T23:21:32.792211003Z 2017/04/23 23:21:32.741732 [INFO ] core: security barrier initialized: shares=1 threshold=1
2017-04-23T23:21:32.792595164Z 2017/04/23 23:21:32.741931 [INFO ] core: post-unseal setup starting
2017-04-23T23:21:32.792796996Z 2017/04/23 23:21:32.774685 [INFO ] core: loaded wrapping token key
2017-04-23T23:21:32.792995068Z 2017/04/23 23:21:32.778538 [INFO ] core: successfully mounted backend: type=generic path=secret/
2017-04-23T23:21:32.793135085Z 2017/04/23 23:21:32.778579 [INFO ] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2017-04-23T23:21:32.793262339Z 2017/04/23 23:21:32.778753 [INFO ] core: successfully mounted backend: type=system path=sys/
2017-04-23T23:21:32.793340089Z 2017/04/23 23:21:32.781623 [INFO ] rollback: starting rollback manager
2017-04-23T23:21:32.793645282Z 2017/04/23 23:21:32.782516 [INFO ] expiration: restoring leases
2017-04-23T23:21:32.79374177Z 2017/04/23 23:21:32.788127 [INFO ] core: post-unseal setup complete
2017-04-23T23:21:32.793994718Z 2017/04/23 23:21:32.788551 [INFO ] core: root token generated
2017-04-23T23:21:32.794171078Z 2017/04/23 23:21:32.788562 [INFO ] core: pre-seal teardown starting
2017-04-23T23:21:32.794251996Z 2017/04/23 23:21:32.788570 [INFO ] core: cluster listeners not running
2017-04-23T23:21:32.794329256Z 2017/04/23 23:21:32.788765 [INFO ] rollback: stopping rollback manager
2017-04-23T23:21:32.794615219Z 2017/04/23 23:21:32.789006 [INFO ] core: pre-seal teardown complete
2017-04-23T23:21:32.794769593Z 2017/04/23 23:21:32.789107 [INFO ] core: vault is unsealed
2017-04-23T23:21:32.794875425Z 2017/04/23 23:21:32.789128 [INFO ] core: post-unseal setup starting
2017-04-23T23:21:32.794893114Z 2017/04/23 23:21:32.789200 [INFO ] core: loaded wrapping token key
2017-04-23T23:21:32.795186687Z 2017/04/23 23:21:32.789451 [INFO ] core: successfully mounted backend: type=generic path=secret/
2017-04-23T23:21:32.795317523Z 2017/04/23 23:21:32.789528 [INFO ] core: successfully mounted backend: type=system path=sys/
2017-04-23T23:21:32.795633113Z 2017/04/23 23:21:32.789539 [INFO ] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2017-04-23T23:21:32.795756548Z 2017/04/23 23:21:32.790164 [INFO ] expiration: restoring leases
2017-04-23T23:21:32.796004705Z 2017/04/23 23:21:32.790203 [INFO ] rollback: starting rollback manager
2017-04-23T23:21:32.796025378Z 2017/04/23 23:21:32.790331 [INFO ] core: post-unseal setup complete

For one of my kubernetes-vault pods:

2017-04-24T00:25:44.369255396Z time="2017-04-24T00:25:44Z" level=debug msg="Discovered 3 nodes: [172.17.0.5 172.17.0.6 172.17.0.7]" 
2017-04-24T00:25:44.37910932Z time="2017-04-24T00:25:44Z" level=debug msg="2017/04/24 00:25:44 [INFO] serf: EventMemberJoin: 172.17.0.6:45678 172.17.0.6" 
2017-04-24T00:25:44.380013387Z time="2017-04-24T00:25:44Z" level=debug msg="2017/04/24 00:25:44 [DEBUG] memberlist: Initiating push/pull sync with: 172.17.0.5:45678" 
2017-04-24T00:25:44.381207478Z time="2017-04-24T00:25:44Z" level=debug msg="2017/04/24 00:25:44 [INFO] serf: EventMemberJoin: 172.17.0.5:45678 172.17.0.5" 
2017-04-24T00:25:44.382171165Z time="2017-04-24T00:25:44Z" level=debug msg="2017/04/24 00:25:44 [DEBUG] memberlist: Initiating push/pull sync with: 172.17.0.6:45678" 
2017-04-24T00:25:44.382572594Z time="2017-04-24T00:25:44Z" level=debug msg="2017/04/24 00:25:44 [DEBUG] memberlist: Stream connection from=172.17.0.6:49702" 
2017-04-24T00:25:44.383317016Z time="2017-04-24T00:25:44Z" level=debug msg="2017/04/24 00:25:44 [DEBUG] memberlist: Failed to join 172.17.0.7: dial tcp 172.17.0.7:45678: getsockopt: connection refused" 
2017-04-24T00:25:44.38739871Z time="2017-04-24T00:25:44Z" level=debug msg="2017/04/24 00:25:44 [INFO] raft: Node at 172.17.0.6:45679 [Follower] entering Follower state (Leader: "")" 
2017-04-24T00:25:44.507764753Z time="2017-04-24T00:25:44Z" level=debug msg="2017/04/24 00:25:44 [DEBUG] serf: messageJoinType: 172.17.0.5:45678" 
2017-04-24T00:25:44.70749638Z time="2017-04-24T00:25:44Z" level=debug msg="2017/04/24 00:25:44 [DEBUG] serf: messageJoinType: 172.17.0.6:45678" 
2017-04-24T00:25:44.707543709Z time="2017-04-24T00:25:44Z" level=debug msg="2017/04/24 00:25:44 [DEBUG] serf: messageJoinType: 172.17.0.5:45678" 
2017-04-24T00:25:44.907442674Z time="2017-04-24T00:25:44Z" level=debug msg="2017/04/24 00:25:44 [DEBUG] serf: messageJoinType: 172.17.0.6:45678" 
2017-04-24T00:25:44.907490195Z time="2017-04-24T00:25:44Z" level=debug msg="2017/04/24 00:25:44 [DEBUG] serf: messageJoinType: 172.17.0.5:45678" 
2017-04-24T00:25:45.115407893Z time="2017-04-24T00:25:45Z" level=debug msg="2017/04/24 00:25:45 [DEBUG] serf: messageJoinType: 172.17.0.6:45678" 
2017-04-24T00:25:45.115458111Z time="2017-04-24T00:25:45Z" level=debug msg="2017/04/24 00:25:45 [DEBUG] serf: messageJoinType: 172.17.0.5:45678" 
2017-04-24T00:25:45.30876808Z time="2017-04-24T00:25:45Z" level=debug msg="2017/04/24 00:25:45 [DEBUG] serf: messageJoinType: 172.17.0.6:45678" 
2017-04-24T00:25:45.393598507Z time="2017-04-24T00:25:45Z" level=debug msg="2017/04/24 00:25:45 [DEBUG] raft-net: 172.17.0.6:45679 accepted connection from: 172.17.0.5:56156" 
2017-04-24T00:25:45.478791861Z time="2017-04-24T00:25:45Z" level=debug msg="2017/04/24 00:25:45 [DEBUG] raft: Node 172.17.0.6:45679 updated peer set (2): [172.17.0.5:45679 172.17.0.6:45679]" 
2017-04-24T00:25:45.589587567Z time="2017-04-24T00:25:45Z" level=debug msg="2017/04/24 00:25:45 [DEBUG] raft-net: 172.17.0.6:45679 accepted connection from: 172.17.0.5:56158" 
2017-04-24T00:25:46.57261593Z time="2017-04-24T00:25:46Z" level=debug msg="2017/04/24 00:25:46 [DEBUG] memberlist: Stream connection from=172.17.0.7:32924" 
2017-04-24T00:25:46.57283431Z time="2017-04-24T00:25:46Z" level=debug msg="2017/04/24 00:25:46 [INFO] serf: EventMemberJoin: 172.17.0.7:45678 172.17.0.7" 
2017-04-24T00:25:46.632195146Z time="2017-04-24T00:25:46Z" level=debug msg="2017/04/24 00:25:46 [DEBUG] raft: Node 172.17.0.6:45679 updated peer set (2): [172.17.0.7:45679 172.17.0.5:45679 172.17.0.6:45679]" 
2017-04-24T00:25:46.769646582Z time="2017-04-24T00:25:46Z" level=debug msg="2017/04/24 00:25:46 [DEBUG] serf: messageJoinType: 172.17.0.7:45678" 
2017-04-24T00:25:46.907037329Z time="2017-04-24T00:25:46Z" level=debug msg="2017/04/24 00:25:46 [DEBUG] serf: messageJoinType: 172.17.0.7:45678" 
2017-04-24T00:25:46.970277247Z time="2017-04-24T00:25:46Z" level=debug msg="2017/04/24 00:25:46 [DEBUG] serf: messageJoinType: 172.17.0.7:45678" 
2017-04-24T00:25:47.108071915Z time="2017-04-24T00:25:47Z" level=debug msg="2017/04/24 00:25:47 [DEBUG] serf: messageJoinType: 172.17.0.7:45678" 
F21 commented 7 years ago

Also, can you start Vault with -log-level=trace?

F21 commented 7 years ago

@travis-bear Where you able to get more info/logs for this?

travis-bear commented 7 years ago

I haven't had time to look at it yet. But this reproduces 100% of the time for me and a co-worker. We'll be looking more at this in the coming days. Hope to get some tcpdump information to see what kubernetes-vault is doing that's crushing vault so hard.

F21 commented 7 years ago

@travis-bear Thanks! 😄 This is a major defect and we'd like to fix it as soon as possible.

travis-bear commented 7 years ago

I'm not sure if this is a red herring or not. Roughly 3x/second, each instance of kubernetes-vault is hitting vault with a request that looks like this:

PUT /v1/intermediate-ca/issue/kubernetes-vault

The response to each request from vault includes 4 certificates, two of which are unique. If these are cryptographically strong certificates that are being generated in response to each request, it's easy to imagine this consuming mucho CPU.

We'll see about getting the trace level Vault log in the next day or two.

F21 commented 7 years ago

@travis-bear Thanks! That might be a start. Can you please post your exact kube deployment (with any sensitive info removed) as well?

Is it possible that your PKI backends are set to a low TTL? In that case, it is possible that the certs are expiring and kubernetes-vault is renewing.

Please also post the output of vault mounts if possible.

F21 commented 7 years ago

Most likely problem is that in Vault 0.7.0, generating a certificate no longer returns the duration of the lease. This is implicitly 0, which causes kubernetes-vault to ask Vault to issue a certificate every second.

See: https://github.com/hashicorp/vault/blob/master/CHANGELOG.md#070-march-21th-2017

Therefore, I suggest implementing the following:

F21 commented 7 years ago

@travis-bear Thanks for your patience and help with debugging this 😄

I just merged #34 which fixes this problem. The changes have been released as v0.4.4.

I was also able to reproduce the problem. In my case, my deployment had prometheus.tls.vaultCertBackend and prometheus.tls.vaultCertRole commented out in the config, so it wasn't requesting certificates from Vault. After uncommenting those, I was able to reproduce the CPU pinning.

Please give v0.4.4 a go and see if it fixes the issue.

JulianWei commented 7 years ago

I work with @travis-bear. Your changes in v0.4.4 fixed the bug. Vault CPU is keeping at 0%, and there is no API request received after sample-app deployed.

F21 commented 7 years ago

@JulianWei Thanks for the heads up! I'll close the issue. Please reopen if it happens again.