Closed kgrodzicki closed 6 years ago
Hi @kgrodzicki,
Sorry to hear that. Funnily enough, a few minutes ago I was working on our Concourse-Up-deployed CredHub instance, and it was working just fine.
How familiar are you with BOSH and debugging BOSH-deployed systems?
Hi @DanielJonesEB, thank you for fast answer. Haven't used BOSH so far. Where to start ?
eval "$(concourse-up info DEPLOYMENT --env)"
, which will populate your session with various environment variables.bosh ssh web
which will get you an SSH session on the VM with CredHub runningsudo su -
to change to root
/var/vcap/sys/log/
, and monit
is running as a process monitor.Let us know if you find anything interesting.
cat credhub.stdout.log [2018-05-28 06:41:41+0000] Could not reach the UAA server
And logs from /var/vcap/sys/log/credhub/credhub.log looks interesting. Many authorization errors like:
2018-05-29T13:50:23.636Z [https-jsse-nio-8844-exec-173] .... INFO --- CEFAudit: CEF:0|cloud_foundry|credhub|0.0.0|GET /api/v1/data|GET /api/v1/data|0|rt=1527601814412 suser=atc_to_credhub suid=uaa-client:atc_to_credhub cs1Label=userAuthenticationMechanism cs1=uaa request=/api/v1/data?current=true&name=%2Fconcourse%2Fteam%2Fgit_paths requestMethod=GET cs3Label=versionUuid cs3=99843440-ad9c-4385-bf03-edd8f6db190d cs4Label=httpStatusCode cs4=200 src=127.0.0.1 dst=127.0.0.1 cs2Label=resourceName cs2=/concourse/team/git_paths cs5Label=resourceUuid cs5=... deviceAction=GET cs6Label=requestDetails cs6={"name":"/concourse/team/git_paths","versions":null,"current":true} 2018-05-29T13:50:23.718Z [https-jsse-nio-8844-exec-58] .... ERROR --- ExceptionHandlers: The request could not be completed because the credential does not exist or you do not have sufficient authorization.
and
2018-05-29T14:29:59.608Z [https-jsse-nio-8844-exec-13] .... INFO --- RetryingEncryptionService: Attempting decrypt 2018-05-29T14:29:59.676Z [https-jsse-nio-8844-exec-42] .... ERROR --- ExceptionHandlers: The request could not be completed because the credential does not exist or you do not have sufficient authorization. 2018-05-29T14:29:59.713Z [https-jsse-nio-8844-exec-85] .... ERROR --- ExceptionHandlers: The request could not be completed because the credential does not exist or you do not have sufficient authorization. 2018-05-29T14:29:59.736Z [https-jsse-nio-8844-exec-74] .... ERROR --- ExceptionHandlers: The request could not be completed because the credential does not exist or you do not have sufficient authorization.
It is only concourse who calls credhub at the moment and it looks like it is DOS'ing making it unresponsive.
Hi @kgrodzicki,
We're having trouble replicating this. Could you give some more details about your deployment?
What flags did you use when you first deployed with concourse-up deploy
? Specifically, what size are your web, workers, and RDS. Also how many workers do you have?
Approximately how many pipelines are there on the Concourse?
After running eval "$(concourse-up info DEPLOYMENT --env)"
you should be able to run credhub find
(assuming you have the credhub cli on your PATH). Is it slow to respond when you do this?
Thanks.
After taking a look at our logs I don't think the errors listed are necessarily related to your problem. We see similar errors but aren't experiencing credhub timeouts or slowness.
Part of the credhub implementation in Concourse is that credentials can be stored either at team level (/concourse/TEAM/credential
) or at pipeline level (/concourse/TEAM/PIPELINE/credential
). Concourse will always check in the latter before the former.
In our logs we are seeing quite a few The request could not be completed because the credential does not exist or you do not have sufficient authorization.
errors but they seem to be from checks for pipeline level credentials. My hypothesis is that these errors are generated whenever a team level credential is requested.
Something you could check is what your credit balance/usage is on your RDS instance. You can see this in the AWS console. I have seen it before where the DB got put under extra load after transitioning a number of pipelines over to using credhub. The credits got exhausted and CPU started being throttled which caused a variety of slowness-related issues. Running concourse-up deploy
with a larger value for --db-size
fixed that issue (AWS will transfer all DB data over to the new larger instance).
Hi @iferunsewe @crsimmons, thank you for your answers.
@iferunsewe We have around 70 pipelines so far.
First installation command was:
concourse-up deploy --worker-size large --workers 1 --domain test.com test
and last update to 0.9.0 version:
concourse-up deploy --worker-size medium --workers 8 --web-size large --db-size medium --region eu-west-1 --domain test.com test
@crsimmons Looks like IOPS has been used on RDS. I am going to update instance type for db and will keep you updated.
After update db instance to xlarge it looks much better now. Thank you guys for your time, it was really helpful !
Unfortunately, the problem does not seem to be fully fixed. This morning concourse was really slow again, and after investigating further we have established the following:
Using secrets and variables from credhub significally slows down our pipelines
Experiment: Created and helloworld pipeline using the following config.
`` resources:
jobs:
Using this config the hello-world task takes 2 min and 10 seconds
By removing the credhub params, the build-time drops to 6 seconds. This config takes 6 seconds `` resources:
jobs:
For every credhub param we add to the config, the build-time increases.
NB! The credhub params we use are all on the team-level, could this impact the resolving time?
Is there any plans to support AWS SSM as an alternative? this would most likely solve our problem, and keep us aligned with our policy of using AWS tools.
In addition, after upgrading to RDS size extra large, the CPU is constantly hitting the roof at 90% to 100%.
This is weird, as the read and write IOPS are very low.
The performance has been gradually degrading, it is not only related to the latest upgrade. (see first post) This is probably because the developers are creating more and more pipelines, thus also generating more and more params/secrets in CredHub
I don't think this issue is the result of a problem with Concourse-up. Concourse-up is just a mechanism for deploying Concourse - I can't think of anything special that we are doing that would result in performance degradation. If anything it sounds like an issue either with Concourse, Credhub, or the integration between the two. You might get some better debugging advice if you raise issues with those projects. You could also start a thread over at the Concourse discuss forum to see if anyone else has encountered this.
It's challenging for me to artificially replicate the load generated by many devs on ~70 pipelines but at clients I have seen Concourse databases slow down tremendously as the number of credhub lookups from pipelines increase. As far as I can tell the only real solution from an ops perspective is to scale the instance.
As for your question about AWS SSM, that isn't on our current roadmap but I have added a note about it to our story planning board.
Yes, we have boosted both web and db to 2xlarge to make it manageable. I guess we have to migrate away from credhub to SSM sooner rather than later. Thanks for the input
We had this exact problem, and diagnosed it to the t2 instance type being used for the web
EC2 instance. By default, Concourse-up uses a t2.small
. It turns out, credhub is a huge CPU monster, and with 30ish pipelines, just ate through all our CPU credits in no time. A throttled T2 instance can't keep up.
I strongly recommend not using T2s for the web instance.
We could add more web sizes to Concourse-up. Either some m4 ones or maybe the ability to enable t2 unlimited on the existing instances.
I'd be interested to know (from @archgrove or @engrun or anyone else seeing this issue) if the web vm CPU is running close to max all the time or if it only spikes up some of the time. You can see a CPU graph on the included grafana dashboard at <your-concourse>:3000
.
We are now running 2xlarge for worker, web, and database. Now it is finally working as expected (although it still takes 10-15 seconds in the pending state before tasks starts). As @archgrove says, our findings is also that CredHub is "a huge CPU monster". Our current deployment is estimated at $1300 amonth, which really is a lot.
Which reminds my of a previous issue regarding secrets management, where we advocated a pass-through for credential management configuration, leaving it up to the users of concourse-up to choose their own credential management. This issues makes this relevant again, as we would like to use AWS SSM instead of credhub given the state of CredHub. I think we have two possible solutions in the longer term. 1) Wait for SSM support on concourse-up 2) Host concourse on EC2 instead
@crsimmons We observed T2 credit depletion due to a persistent > 30% CPU load on our t2.small instance. SSHing into the box indicated that most of this seemed due to the credhub Java processes. As soon as the CPU credits were exhausted, it was pegged to 100% (no surprise on a throttled T2). A t2.large running the same workload is persistently sitting around 30% CPU usage, which is fortunately below the threshold at which the credit balance will decrease. However, we won't need many more pipelines before even that isn't enough (hence my advice to avoid T2 instance types).
Interestingly, our RDS is fine as a t2.small. Whatever the workload is, it's not that database heavy.
I'd love to get SSM support into concourse-up, given that it has native support in Concourse. This would fit with our preferred model of keeping everything that AWS offers "as a service" inside the AWS estate. Until someone (maybe us) finds time to PR that, documenting the web instance sizes, and offering non-t2s would probably be a win.
Frankly, I'd also like to find out why the ATC + credhub use that much resource. Our workload isn't particularly heavy, and it seems excessive to need large instances to run the CI pipeline management software - not even the pipeline workload.
@engrun @crsimmons One other thing we're trialing is reducing the check-every
interval (https://concourse-ci.org/resources.html#resource-check-every) on resources which change less frequently. The majority of the credhub load seems to be induced by our circa 100 resources polling every minute, each of which induces a couple of credhub lookups. One would think that an average of 2-3 calls a second wouldn't be that much of a drain, but it seems to be (when we ran into this the first time, our fix was "Pause every non-critical pipeline until credits rebuilt").
From experience, setting check-every
to 2-3min greatly reduces the number of checks without negatively impacting pipelines. It means it will take marginally longer for changes to be picked up but manual job triggering will pick them up immediately.
@archgrove @crsimmons We have solved check-every
by using webhooks instead. And builds triggers "instantly"
I know SSM support was recently added to your backlog. Any chance this issue could affect the priority it?
Hi all,
@iferunsewe and @takeyourhatoff did some profiling and found behaviour in the CredHub client library that is causing massive amounts of inefficiency: https://github.com/cloudfoundry-incubator/credhub-cli/pull/45
We've reached out to the CredHub and Concourse teams to get this merged into the right places.
Thankyou for your patience!
For those interested, you can follow our Concourse issue here https://github.com/concourse/concourse/issues/2373
Hi,
after updating concourse to latest version with concourse-up we experience that credhub is really slow or it actually timeouts. Can you please give us some tips how to troubleshoot this ?