cloudfoundry-community / firehose-to-syslog

Send firehose events from Cloud Foundry to syslog.
MIT License
44 stars 58 forks source link

Reduce load on Cloud Controller API #181

Closed lisamburns closed 6 years ago

lisamburns commented 6 years ago

Frequently polling cloud_controller on the v2/spaces/:guid/summary endpoint can cause performance issues with the Cloud Controller API for Cloud Foundry deployments with a lot of apps/orgs/spaces. This is because this endpoint is similar to the inline-relations-depth=2 query parameter in that it fetches levels of nested objects from the database, including all the routes for each app, as well as all the services in the space. It looks like here, the only fields that we are using for each app is space_guid, organization_guid, space_name, and organization_name.

It could be more efficient to fetch this metadata via the apps list endpoint, and then get the spaces and org names from the organization list and spaces list endpoint.

shinji62 commented 6 years ago

@lisamcho Thanks for you comments. First suggestion from CAPI team was to remove this IRD=2 call, now you say that the summary is still not good enough. Calling apps endpoints then org then space could lead to 10K+ call top the CC (even more) what do you suggest.

Please feel free to contribute if you want.

stwomack commented 6 years ago

We are seeing very high performance hits to the CC Api, seemingly due to this scenario (tons of apps/orgs/spaces in this foundation). Are there any plans to relieve this issue?

shinji62 commented 6 years ago

@stwomack No real magic here, I already reduce the type of call to the CC. unless we create a common cache for all f2s instances.

Not sure what you means by ton of apps/orgs/spaces but you can still put the CC pull time to 0s (or 0) and the big load should be only when you start the f2s instances. this have to effect to stop the periodic CC pulling but as the side effect that you will no be able to use F2S_DISABLE_LOGGING anymore.

stwomack commented 6 years ago

Oddly, setting cc-pull-time = 0 had no effect. The CC slowness and failures continued.

shinji62 commented 6 years ago

How many apps /spaces / orgs do you have ? How many instances of f2f do you have ? Can you count the number of f2f call to your CC ? (check the http agent)

jgeorge300 commented 6 years ago

I'm experiencing this issue as well. I have 26 orgs. Each org has ~5 spaces. A couple have more than 10. I have 4 different f2s running with 2 instances each. We have to send different events to different end points. According to the toplogs-cloudcontroller.sh script, f2s had 1021945 requests in a 15 hour period while the next top requester (monit) only had 23264 requests.

shinji62 commented 6 years ago

@jgeorge300 Did you try with the cc-pull-time = 0 ?

aeijdenberg commented 6 years ago

Hi all, we've also been observing high load on our API servers. Here are our findings from investigation over the past week or so:

  1. Baseline Because of bad reasons, we were running an older build (roughly November last year) (of this component only), prior to the "IRD" changes. While the default configuration was 60 seconds between polling, we noted that it took 90 seconds, and thus would start again immediately. Here's our Grafana graphs for this period - this shows cloud_controller_ng job on our 2 API instances:

image

  1. No IRD Earlier this week we upgraded to the latest version, and while our average CPU % usage dropped a lot, we were still getting alerts due to high load factors (that we weren't actually before!). It would seem to go in waves, e.g. here are the same graphs for a similar period:

image

While an improvement (average dropped from around 90% CPU to around 50%) it still isn't great.

  1. Lazy loading Digging into the code, I noted that it seems to be super keen to poll for all apps, spaces and orgs once a minute. I decided to see what would happen if instead I changed it to lazily load app, space and org information, and cache each bit of data for approximately the requested cc-pull-time(with +/- 25% randomness to spread out requests). After making code changes to this effect, and deploying, we saw:

image

  1. Longer TTL, and stripping of application name suffixes Although the perf gains we saw with lazy loading are likely good enough for our needs, we noted that since lazy loading would guarantee picking up new applications straight away, that means that the only benefit gained by a short poll time between API calls is to pick up any renames made to applications, space or orgs. At least in our deployments, we never rename spaces or orgs (I'm not even sure if it's possible), and the only context in which we see applications renamed is during a zero-downtme-push (foo is renamed to foo-venerable). Since the log messages are quite likely to be associated with the wrong name for the app anyway, it would be more useful to us to keep it under the original name (no suffix) anyway, so we added a flag to allow specifying a list of suffixes to strip from application names. Final result:

image

We'll keep an eye on it, but it seems to be working OK for our needs so far. Our largest deployment only has around 250 applications, but they vast majority of these are accessed by developers only, so our actual traffic numbers are low. It's not clear to me whether lazy loading (in the same thread as what processes the nozzle) could cause issues with slowing the nozzle down too much in larger deployments.

Before I get overly enthusiastic and submit a serious PR for it, I wanted to check in on a few things...

  1. Is there a particular reason this data is persisted to a BoltDB database, which by default is written to a temporary location and thus is unlikely to usually persist any longer than the job itself?

  2. Has lazy loading been suggested and rejected in the past? I figured that in theory it should be quicker, as generally loading individual entities from CC seems much quicker than enumerating them.

  3. Why are the TTLs so low by default for names that surely are very unlikely to change? My guess is this is to pick up new applications? Can lazy loading prevent that from being an issue?

I've opened a WIP PR (https://github.com/cloudfoundry-community/firehose-to-syslog/pull/192) to share our work to date. What we've done so far has been mostly to experiment - I'd be happy to polish it some more if the overall approach is supported.

Let me know your thoughts.

shinji62 commented 6 years ago

@aeijdenberg Thanks for the PR and the info. My customer have more than x 000 app and in some env, people are pushing a lot, so new app id and so on and of course across hundred and hundred of Orgs / Spaces

For the traffic (tps or qps) is more than 10K so having to lazy load a new app, could quickly help to fullfit the buffer.

Anyway I recognize the work and the improvement you did which is great ! What I would like, is too be a little more flexible, meaning choosing between a "lazy" and "old way" mode of loading from the CC.

I guess this is fair enough for people to try lazy and quickly go back to the old loading mode.

Should not be difficult to add this options, and just add a new go file to implement the lazy mode. (You can even remove boltdDB if you want in the lazy implementation)

For your question 1 - Not everyone is running the nozzle as CF app. People using it write to a disk which persist accross VM replacement and so on. 2 - Yes, if I recall but at the time we found that loading app in the background every X time should have been ok. 3 - Not sure why you implement TTL ...

aeijdenberg commented 6 years ago

@shinji62 - I wonder if maybe the best of both worlds might be to initially do a big load before starting consumption of the firehose, and then combine lazy loading for new apps and a big slightly randomised TTL (to spread out requests) for refreshing app / space / org names?

In that manner the DB should be initially populated, and only new messages for new apps may trigger a momentary pause?

shinji62 commented 6 years ago

@aeijdenberg If you pull CC time to 0 that's what happen mostly. We fully load the DB then on unknow apps we fill the DB.

Could just that calling space summary could be still expensive. I still think some call are expensive.

aeijdenberg commented 6 years ago

@shinji62 - thanks - I think I misread that code on first investigation. I suspect that setting that may well solve most of the issues we were looking to solve. I'll take a deeper look and maybe suggest a few small tidy up PRs.

Yes, enumerating apps seems extraordinarily slow from cloud controller, which seems like a problem worth solving on its own - though my Ruby knowledge isn't as extensive.

shinji62 commented 6 years ago

@aeijdenberg any update on your side?

aeijdenberg commented 6 years ago

@shinji62 - not yet - other than we've been running our patched version without issue for about a week now and are seeing a big reduction in load across all our envs. I'll try to get back onto this shortly.

aeijdenberg commented 6 years ago

@shinji62 , I put an update on our progress in the PR: https://github.com/cloudfoundry-community/firehose-to-syslog/pull/192

I'll paste here too:

We've been running this for a few weeks now without issue and have seen a big drop in CPU usage on our API servers.

Summary of changes:

  1. Load all apps before starting to read from nozzle. In this manner we won't fill our buffer by falling behind before we have apps populated.
  2. Instead of fetching all apps on scheduled basic, instead lazily re-fetch each app or space or org based on a randomized TTL.
  3. Allow for in-memory cache instead of bolt if not persisting the database (which by default bosh release is not, as the default path is not in a persistent location).
  4. Add options to strip app name suffixes. This is useful for blue/green and zero downtime pushes where apps typically have temporary suffixes during renames. Since this data is often incorrect anyway, we may as well strip it.
  5. Stop using cf-client for important API calls, as it seems to often want to get smart and make successive calls for extra info that we don't need, contributing to latency and CPU load on API servers.

I think that's mostly it. We could also considering bumping the default TTL for the cached data from 60 seconds to something much larger, as the only effect of this being cached is if an app/space/org is renamed, which I imagine is a fairly rare event.

shinji62 commented 6 years ago

@aeijdenberg Thanks, will take a look next week.

Scoobed commented 6 years ago

@shinji62 We would like to pull this version as it seems like a promising solution where our Cloud Controllers are seeing 7,000,000 404s a day because of the #4 in the list (blue-green deploys) and just caching in general. Longer TTL is great idea too. Do you have any idea when that will get out of the develop branch and be rolled into master?

aeijdenberg commented 6 years ago

It's probably waiting on me to update the tests. I'll submit a PR today to remove the failing tests - I realise there's probably more wanted, but it's at least better than tests that don't compile.

shinji62 commented 6 years ago

@Scoobed as @aeijdenberg said I am waiting for it's PR for the test. So I guess new release will occur this week.

aeijdenberg commented 6 years ago

@shinji62 - have added some basic tests here: https://github.com/cloudfoundry-community/firehose-to-syslog/pull/196

shinji62 commented 6 years ago

@Scoobed @aeijdenberg release 5.0.0 is now out. Please use it, and reopen this issue if needed.