cloudfoundry-community / firehose-to-syslog

Send firehose events from Cloud Foundry to syslog.
MIT License
44 stars 58 forks source link

Options to reduce impact on cloud controller performance #177

Closed als1860 closed 6 years ago

als1860 commented 6 years ago
I have multiple questions that I am hoping someone can help with.

First I have a cloud foundry foundation which I think has a fairly high firehose message rate, ~ 50,000 messages per second.  It was found that we were missing a significant number of messages from our syslog end point.  We tried various things. The solution which we settled on was increasing the number of dopplers and firehose-to-syslog instances.  We settled in at 15 instances.  We then found that calls interacting with the cloud controllers began to perform trebly.  We found that our cloud controllers were being overrun by request from the firehose-to-syslog.   We have scaled back the number of firehose-to-syslog instances again, but are losing messages.

I am look to reduce, or at least slow down the number of requests to the cloud controllers.  In looking in client.go there are a couple of parameters we are looking to use.

My questions.

  1. requestLimit (cc-rps) : "CloudController Polling request by second" a. What does this actually regulate? b. Does it truly default to 50 requests per second? c. What is the valid value range for this parameter?
  2. tickerTime(cc-pull-time): "CloudController Polling time in sec" a. What does this actually regulate? Is it the time between polling session, the limit for how long a polling session can run, etc? b. What is the valid value range for this parameter?
  3. Do you have any other/better suggestions or options to try?
  4. What is a good rule of thumb for scaling firehose-to-syslog instances

Thanks. Alan.

shinji62 commented 6 years ago

RequestLimit (cc-rps) : "CloudController Polling request by second" tickerTime(cc-pull-time): "CloudController Polling time in sec"

Both of them are related, f2s has a caching system, meaning to get app name, org name and space we need to pull the CloudController.

Currently we fully pull the CC the first time f2s is starting, unless the cache db file is there. Then we fully every cc-pull-time.

If we got an application id that is not in the cache db, basically if an app is push between the last and next pulling time, we request to the CC the app information.

To pull application information before we use a bad cloudcontroller call called IRD2. This was putting load on the CC as there was not limit on the number of object fetch by the CC.

If cc-pull-time is 0 then we do not fully pull the CC and we request app information for every new app.

The cc-rps only apply to the full pull when we reload information every cc-pull-time. The RPS is truly 50 by default.

What is a good rule of thumb for scaling firehose-to-syslog instances One by Traffic controller should be enough. But if you syslog endpoint is slow to ingest logs well f2s is going to throw your logs message.

No Magic here

als1860 commented 6 years ago

Awesome. Thanks for the response. I think this will be very helpful. Just to make sure I understand the behaviors.

Is application data ever purged from the cache, if so how is it determined when to purge it?

Thanks. Alan.

shinji62 commented 6 years ago

For cc-pull-time = 0 F2S will pull only a startup.

We don't purge the Data for now. If someone got the issue with the size of the dataset. We can add this options.