EngineerBetter / concourse-up

Deprecated - used Control Tower instead
https://github.com/EngineerBetter/control-tower
Apache License 2.0
203 stars 29 forks source link

Credhub request is really slow or timeout [critical] #46

Closed kgrodzicki closed 6 years ago

kgrodzicki commented 6 years ago

Hi,

after updating concourse to latest version with concourse-up we experience that credhub is really slow or it actually timeouts. Can you please give us some tips how to troubleshoot this ?

DanielJonesEB commented 6 years ago

Hi @kgrodzicki,

Sorry to hear that. Funnily enough, a few minutes ago I was working on our Concourse-Up-deployed CredHub instance, and it was working just fine.

How familiar are you with BOSH and debugging BOSH-deployed systems?

kgrodzicki commented 6 years ago

Hi @DanielJonesEB, thank you for fast answer. Haven't used BOSH so far. Where to start ?

DanielJonesEB commented 6 years ago
  1. Run eval "$(concourse-up info DEPLOYMENT --env)", which will populate your session with various environment variables.
  2. bosh ssh web which will get you an SSH session on the VM with CredHub running
  3. sudo su - to change to root
  4. Have a look around on the box. Logs are in /var/vcap/sys/log/, and monit is running as a process monitor.

Let us know if you find anything interesting.

kgrodzicki commented 6 years ago

cat credhub.stdout.log [2018-05-28 06:41:41+0000] Could not reach the UAA server

And logs from /var/vcap/sys/log/credhub/credhub.log looks interesting. Many authorization errors like:

2018-05-29T13:50:23.636Z [https-jsse-nio-8844-exec-173] .... INFO --- CEFAudit: CEF:0|cloud_foundry|credhub|0.0.0|GET /api/v1/data|GET /api/v1/data|0|rt=1527601814412 suser=atc_to_credhub suid=uaa-client:atc_to_credhub cs1Label=userAuthenticationMechanism cs1=uaa request=/api/v1/data?current=true&name=%2Fconcourse%2Fteam%2Fgit_paths requestMethod=GET cs3Label=versionUuid cs3=99843440-ad9c-4385-bf03-edd8f6db190d cs4Label=httpStatusCode cs4=200 src=127.0.0.1 dst=127.0.0.1 cs2Label=resourceName cs2=/concourse/team/git_paths cs5Label=resourceUuid cs5=... deviceAction=GET cs6Label=requestDetails cs6={"name":"/concourse/team/git_paths","versions":null,"current":true} 2018-05-29T13:50:23.718Z [https-jsse-nio-8844-exec-58] .... ERROR --- ExceptionHandlers: The request could not be completed because the credential does not exist or you do not have sufficient authorization.

and 2018-05-29T14:29:59.608Z [https-jsse-nio-8844-exec-13] .... INFO --- RetryingEncryptionService: Attempting decrypt 2018-05-29T14:29:59.676Z [https-jsse-nio-8844-exec-42] .... ERROR --- ExceptionHandlers: The request could not be completed because the credential does not exist or you do not have sufficient authorization. 2018-05-29T14:29:59.713Z [https-jsse-nio-8844-exec-85] .... ERROR --- ExceptionHandlers: The request could not be completed because the credential does not exist or you do not have sufficient authorization. 2018-05-29T14:29:59.736Z [https-jsse-nio-8844-exec-74] .... ERROR --- ExceptionHandlers: The request could not be completed because the credential does not exist or you do not have sufficient authorization.

It is only concourse who calls credhub at the moment and it looks like it is DOS'ing making it unresponsive.

iferunsewe commented 6 years ago

Hi @kgrodzicki,

We're having trouble replicating this. Could you give some more details about your deployment?

What flags did you use when you first deployed with concourse-up deploy? Specifically, what size are your web, workers, and RDS. Also how many workers do you have?

Approximately how many pipelines are there on the Concourse?

After running eval "$(concourse-up info DEPLOYMENT --env)" you should be able to run credhub find (assuming you have the credhub cli on your PATH). Is it slow to respond when you do this?

Thanks.

crsimmons commented 6 years ago

After taking a look at our logs I don't think the errors listed are necessarily related to your problem. We see similar errors but aren't experiencing credhub timeouts or slowness.

Part of the credhub implementation in Concourse is that credentials can be stored either at team level (/concourse/TEAM/credential) or at pipeline level (/concourse/TEAM/PIPELINE/credential). Concourse will always check in the latter before the former.

In our logs we are seeing quite a few The request could not be completed because the credential does not exist or you do not have sufficient authorization. errors but they seem to be from checks for pipeline level credentials. My hypothesis is that these errors are generated whenever a team level credential is requested.

Something you could check is what your credit balance/usage is on your RDS instance. You can see this in the AWS console. I have seen it before where the DB got put under extra load after transitioning a number of pipelines over to using credhub. The credits got exhausted and CPU started being throttled which caused a variety of slowness-related issues. Running concourse-up deploy with a larger value for --db-size fixed that issue (AWS will transfer all DB data over to the new larger instance).

kgrodzicki commented 6 years ago

Hi @iferunsewe @crsimmons, thank you for your answers.

@iferunsewe We have around 70 pipelines so far.

First installation command was: concourse-up deploy --worker-size large --workers 1 --domain test.com test and last update to 0.9.0 version: concourse-up deploy --worker-size medium --workers 8 --web-size large --db-size medium --region eu-west-1 --domain test.com test

@crsimmons Looks like IOPS has been used on RDS. I am going to update instance type for db and will keep you updated.

kgrodzicki commented 6 years ago

After update db instance to xlarge it looks much better now. Thank you guys for your time, it was really helpful !

engrun commented 6 years ago

Unfortunately, the problem does not seem to be fully fixed. This morning concourse was really slow again, and after investigating further we have established the following:

Using secrets and variables from credhub significally slows down our pipelines

Experiment: Created and helloworld pipeline using the following config.

`` resources:

jobs:

Using this config the hello-world task takes 2 min and 10 seconds

By removing the credhub params, the build-time drops to 6 seconds. This config takes 6 seconds `` resources:

jobs:

For every credhub param we add to the config, the build-time increases.

NB! The credhub params we use are all on the team-level, could this impact the resolving time?

Is there any plans to support AWS SSM as an alternative? this would most likely solve our problem, and keep us aligned with our policy of using AWS tools.

engrun commented 6 years ago

In addition, after upgrading to RDS size extra large, the CPU is constantly hitting the roof at 90% to 100%.
This is weird, as the read and write IOPS are very low.

engrun commented 6 years ago

The performance has been gradually degrading, it is not only related to the latest upgrade. (see first post) This is probably because the developers are creating more and more pipelines, thus also generating more and more params/secrets in CredHub

crsimmons commented 6 years ago

I don't think this issue is the result of a problem with Concourse-up. Concourse-up is just a mechanism for deploying Concourse - I can't think of anything special that we are doing that would result in performance degradation. If anything it sounds like an issue either with Concourse, Credhub, or the integration between the two. You might get some better debugging advice if you raise issues with those projects. You could also start a thread over at the Concourse discuss forum to see if anyone else has encountered this.

It's challenging for me to artificially replicate the load generated by many devs on ~70 pipelines but at clients I have seen Concourse databases slow down tremendously as the number of credhub lookups from pipelines increase. As far as I can tell the only real solution from an ops perspective is to scale the instance.

As for your question about AWS SSM, that isn't on our current roadmap but I have added a note about it to our story planning board.

engrun commented 6 years ago

Yes, we have boosted both web and db to 2xlarge to make it manageable. I guess we have to migrate away from credhub to SSM sooner rather than later. Thanks for the input

archgrove commented 6 years ago

We had this exact problem, and diagnosed it to the t2 instance type being used for the web EC2 instance. By default, Concourse-up uses a t2.small. It turns out, credhub is a huge CPU monster, and with 30ish pipelines, just ate through all our CPU credits in no time. A throttled T2 instance can't keep up.

I strongly recommend not using T2s for the web instance.

crsimmons commented 6 years ago

We could add more web sizes to Concourse-up. Either some m4 ones or maybe the ability to enable t2 unlimited on the existing instances.

I'd be interested to know (from @archgrove or @engrun or anyone else seeing this issue) if the web vm CPU is running close to max all the time or if it only spikes up some of the time. You can see a CPU graph on the included grafana dashboard at <your-concourse>:3000.

engrun commented 6 years ago

We are now running 2xlarge for worker, web, and database. Now it is finally working as expected (although it still takes 10-15 seconds in the pending state before tasks starts). As @archgrove says, our findings is also that CredHub is "a huge CPU monster". Our current deployment is estimated at $1300 amonth, which really is a lot.

Which reminds my of a previous issue regarding secrets management, where we advocated a pass-through for credential management configuration, leaving it up to the users of concourse-up to choose their own credential management. This issues makes this relevant again, as we would like to use AWS SSM instead of credhub given the state of CredHub. I think we have two possible solutions in the longer term. 1) Wait for SSM support on concourse-up 2) Host concourse on EC2 instead

archgrove commented 6 years ago

@crsimmons We observed T2 credit depletion due to a persistent > 30% CPU load on our t2.small instance. SSHing into the box indicated that most of this seemed due to the credhub Java processes. As soon as the CPU credits were exhausted, it was pegged to 100% (no surprise on a throttled T2). A t2.large running the same workload is persistently sitting around 30% CPU usage, which is fortunately below the threshold at which the credit balance will decrease. However, we won't need many more pipelines before even that isn't enough (hence my advice to avoid T2 instance types).

Interestingly, our RDS is fine as a t2.small. Whatever the workload is, it's not that database heavy.

I'd love to get SSM support into concourse-up, given that it has native support in Concourse. This would fit with our preferred model of keeping everything that AWS offers "as a service" inside the AWS estate. Until someone (maybe us) finds time to PR that, documenting the web instance sizes, and offering non-t2s would probably be a win.

Frankly, I'd also like to find out why the ATC + credhub use that much resource. Our workload isn't particularly heavy, and it seems excessive to need large instances to run the CI pipeline management software - not even the pipeline workload.

archgrove commented 6 years ago

@engrun @crsimmons One other thing we're trialing is reducing the check-every interval (https://concourse-ci.org/resources.html#resource-check-every) on resources which change less frequently. The majority of the credhub load seems to be induced by our circa 100 resources polling every minute, each of which induces a couple of credhub lookups. One would think that an average of 2-3 calls a second wouldn't be that much of a drain, but it seems to be (when we ran into this the first time, our fix was "Pause every non-critical pipeline until credits rebuilt").

crsimmons commented 6 years ago

From experience, setting check-every to 2-3min greatly reduces the number of checks without negatively impacting pipelines. It means it will take marginally longer for changes to be picked up but manual job triggering will pick them up immediately.

engrun commented 6 years ago

@archgrove @crsimmons We have solved check-every by using webhooks instead. And builds triggers "instantly"

engrun commented 6 years ago

I know SSM support was recently added to your backlog. Any chance this issue could affect the priority it?

DanielJonesEB commented 6 years ago

Hi all,

@iferunsewe and @takeyourhatoff did some profiling and found behaviour in the CredHub client library that is causing massive amounts of inefficiency: https://github.com/cloudfoundry-incubator/credhub-cli/pull/45

We've reached out to the CredHub and Concourse teams to get this merged into the right places.

Thankyou for your patience!

crsimmons commented 6 years ago

For those interested, you can follow our Concourse issue here https://github.com/concourse/concourse/issues/2373