Closed lisamburns closed 6 years ago
@lisamcho Thanks for you comments. First suggestion from CAPI team was to remove this IRD=2
call, now you say that the summary is still not good enough. Calling apps endpoints then org then space could lead to 10K+ call top the CC (even more) what do you suggest.
Please feel free to contribute if you want.
We are seeing very high performance hits to the CC Api, seemingly due to this scenario (tons of apps/orgs/spaces in this foundation). Are there any plans to relieve this issue?
@stwomack No real magic here, I already reduce the type of call to the CC. unless we create a common cache for all f2s instances.
Not sure what you means by ton of apps/orgs/spaces but you can still put the CC pull time to 0s (or 0) and the big load should be only when you start the f2s instances. this have to effect to stop the periodic CC pulling but as the side effect that you will no be able to use F2S_DISABLE_LOGGING
anymore.
Oddly, setting cc-pull-time = 0
had no effect. The CC slowness and failures continued.
How many apps /spaces / orgs do you have ? How many instances of f2f do you have ? Can you count the number of f2f call to your CC ? (check the http agent)
I'm experiencing this issue as well. I have 26 orgs. Each org has ~5 spaces. A couple have more than 10. I have 4 different f2s running with 2 instances each. We have to send different events to different end points. According to the toplogs-cloudcontroller.sh script, f2s had 1021945 requests in a 15 hour period while the next top requester (monit) only had 23264 requests.
@jgeorge300 Did you try with the cc-pull-time = 0
?
Hi all, we've also been observing high load on our API servers. Here are our findings from investigation over the past week or so:
cloud_controller_ng
job on our 2 API instances:While an improvement (average dropped from around 90% CPU to around 50%) it still isn't great.
cc-pull-time
(with +/- 25% randomness to spread out requests). After making code changes to this effect, and deploying, we saw:foo
is renamed to foo-venerable
). Since the log messages are quite likely to be associated with the wrong name for the app anyway, it would be more useful to us to keep it under the original name (no suffix) anyway, so we added a flag to allow specifying a list of suffixes to strip from application names. Final result:We'll keep an eye on it, but it seems to be working OK for our needs so far. Our largest deployment only has around 250 applications, but they vast majority of these are accessed by developers only, so our actual traffic numbers are low. It's not clear to me whether lazy loading (in the same thread as what processes the nozzle) could cause issues with slowing the nozzle down too much in larger deployments.
Before I get overly enthusiastic and submit a serious PR for it, I wanted to check in on a few things...
Is there a particular reason this data is persisted to a BoltDB database, which by default is written to a temporary location and thus is unlikely to usually persist any longer than the job itself?
Has lazy loading been suggested and rejected in the past? I figured that in theory it should be quicker, as generally loading individual entities from CC seems much quicker than enumerating them.
Why are the TTLs so low by default for names that surely are very unlikely to change? My guess is this is to pick up new applications? Can lazy loading prevent that from being an issue?
I've opened a WIP PR (https://github.com/cloudfoundry-community/firehose-to-syslog/pull/192) to share our work to date. What we've done so far has been mostly to experiment - I'd be happy to polish it some more if the overall approach is supported.
Let me know your thoughts.
@aeijdenberg Thanks for the PR and the info. My customer have more than x 000 app and in some env, people are pushing a lot, so new app id and so on and of course across hundred and hundred of Orgs / Spaces
For the traffic (tps or qps) is more than 10K so having to lazy load a new app, could quickly help to fullfit the buffer.
Anyway I recognize the work and the improvement you did which is great ! What I would like, is too be a little more flexible, meaning choosing between a "lazy" and "old way" mode of loading from the CC.
I guess this is fair enough for people to try lazy and quickly go back to the old loading mode.
Should not be difficult to add this options, and just add a new go file to implement the lazy mode. (You can even remove boltdDB if you want in the lazy implementation)
For your question 1 - Not everyone is running the nozzle as CF app. People using it write to a disk which persist accross VM replacement and so on. 2 - Yes, if I recall but at the time we found that loading app in the background every X time should have been ok. 3 - Not sure why you implement TTL ...
@shinji62 - I wonder if maybe the best of both worlds might be to initially do a big load before starting consumption of the firehose, and then combine lazy loading for new apps and a big slightly randomised TTL (to spread out requests) for refreshing app / space / org names?
In that manner the DB should be initially populated, and only new messages for new apps may trigger a momentary pause?
@aeijdenberg If you pull CC time to 0
that's what happen mostly. We fully load the DB then on unknow apps we fill the DB.
Could just that calling space summary could be still expensive. I still think some call are expensive.
@shinji62 - thanks - I think I misread that code on first investigation. I suspect that setting that may well solve most of the issues we were looking to solve. I'll take a deeper look and maybe suggest a few small tidy up PRs.
Yes, enumerating apps seems extraordinarily slow from cloud controller, which seems like a problem worth solving on its own - though my Ruby knowledge isn't as extensive.
@aeijdenberg any update on your side?
@shinji62 - not yet - other than we've been running our patched version without issue for about a week now and are seeing a big reduction in load across all our envs. I'll try to get back onto this shortly.
@shinji62 , I put an update on our progress in the PR: https://github.com/cloudfoundry-community/firehose-to-syslog/pull/192
I'll paste here too:
We've been running this for a few weeks now without issue and have seen a big drop in CPU usage on our API servers.
I think that's mostly it. We could also considering bumping the default TTL for the cached data from 60 seconds to something much larger, as the only effect of this being cached is if an app/space/org is renamed, which I imagine is a fairly rare event.
@aeijdenberg Thanks, will take a look next week.
@shinji62 We would like to pull this version as it seems like a promising solution where our Cloud Controllers are seeing 7,000,000 404s a day because of the #4 in the list (blue-green deploys) and just caching in general. Longer TTL is great idea too. Do you have any idea when that will get out of the develop branch and be rolled into master?
It's probably waiting on me to update the tests. I'll submit a PR today to remove the failing tests - I realise there's probably more wanted, but it's at least better than tests that don't compile.
@Scoobed as @aeijdenberg said I am waiting for it's PR for the test. So I guess new release will occur this week.
@shinji62 - have added some basic tests here: https://github.com/cloudfoundry-community/firehose-to-syslog/pull/196
@Scoobed @aeijdenberg release 5.0.0 is now out. Please use it, and reopen this issue if needed.
Frequently polling cloud_controller on the
v2/spaces/:guid/summary
endpoint can cause performance issues with the Cloud Controller API for Cloud Foundry deployments with a lot of apps/orgs/spaces. This is because this endpoint is similar to theinline-relations-depth=2
query parameter in that it fetches levels of nested objects from the database, including all the routes for each app, as well as all the services in the space. It looks like here, the only fields that we are using for each app is space_guid, organization_guid, space_name, and organization_name.It could be more efficient to fetch this metadata via the apps list endpoint, and then get the spaces and org names from the organization list and spaces list endpoint.