cloudfoundry / diego-release

BOSH Release for Diego
Apache License 2.0
201 stars 212 forks source link

[FEATURE REQUEST]: As a CF Operator, I expect that when I observe the "AppInstanceExceededLogRateLimitCount" metric, I can see the app instance details that caused this value to be incremented so I can take action with the app owner #457

Closed sunjayBhatia closed 4 years ago

sunjayBhatia commented 4 years ago

Is your feature request related to a problem?

Right now, when AppInstanceExceededLogRateLimitCount is emitted (here), it is emitted as a per-cell tagged counter metric. The loggregator agent adds just the cell-specific tags to the metric, nothing more, resulting in metrics that look like this:

deployment:"cf" job:"diego-cell" index:"0e98fd00-47b2-4589-94f0-385f78b3a04d" ip:"10.0.1.12" tags:<key:"instance_id" value:"0e98fd00-47b2-4589-94f0-385f78b3a04d" > tags:<key:"source_id" value:"rep" > counterEvent:<name:"AppInstanceExceededLogRateLimitCount" delta:1 total:206 >

As a result, operators must do some additional work to filter the app instances on the cell this metric comes from to identify the actual culprit of the chatty logging. This is not super straightforward and we can do better for operators so that it is much easier to identify problematic apps (and even app instances).

Describe the solution you'd like

In the the log rate limit reporter, we have access to the metric tags set on the app's desired LRP and use them to push our own log line into the app log stream. We should be able to do the same thing with our AppInstanceExceededLogRateLimitCount metric in order to tag it accordingly and ensure the value emitted is not a per-cell metric but a per-app instance metric.

We can potentially just make a new version of our IncrementCounter method to add tags to the envelope we want to send using this option.

Diego repo

executor

Describe alternatives you've considered

mdelillo commented 4 years ago

In lieu of emitting app instance details in the log message, we have come up with some cf commands that will show you that data using existing plugins.

First, install the firehose and log-cache plugins:

cf install-plugin 'Firehose Plugin'
cf install-plugin 'log-cache'

This command will show when any app has exceeded the log rate: cf tail -f rep | grep AppInstanceExceededLogRateLimitCount

And this command will show the specific app which has exceeded the rate: cf nozzle -f LogMessage | grep 'app instance exceeded log rate limit'

sunjayBhatia commented 4 years ago

In lieu of emitting app instance details in the log message

I am not suggesting emitting app instance details in the log message, but instead tagging the counter metric (here) we emit with the same tags as the log message we put in the app log stream so someone aggregating metrics can drill down into the metric values for individual apps. This would change the source/instance id of the metric to be from the app instance instead of the cell

I'm also curious how this cf nozzle approach will work on a larger foundation, inspecting each log line from every app to find the chatty ones seems expensive

sunjayBhatia commented 4 years ago

reopening so we can continue to chat about this

robertjsullivan commented 4 years ago

If we add a tag that adds application_instance_id and application_guid, the AppInstanceExceededLogRateLimitCount metric will have an enormous cardinality. This will create a large storage burden on downstream monitoring systems. Given that we expect operators will want to rectify the bad actors quickly, we would be storing a lot of metrics that are not being used.

I think that we should have the count be by diego cell only. Logging the offending app names to Diego's logs on a set cadence seems like a reasonable solution for identifying the bad actors.

jvshahid commented 4 years ago

As a CF Operator, I expect that when I observe the "AppInstanceExceededLogRateLimitCount" metric,

@sunjayBhatia I don't understand why the operator is the persona of interest in this case. I thought the whole purpose of this story was to avoid having applications negatively affect their neighbors. I think the app developer will be more interested in knowing that their application logs are getting dropped.

If we add a tag that adds application_instance_id and application_guid, the AppInstanceExceededLogRateLimitCount metric will have an enormous cardinality.

@robertjsullivan How is this different from the CPU and Memory metrics that we are currently emitting?

aminjam commented 4 years ago

@robertjsullivan How is this different from the CPU and Memory metrics that we are currently emitting?

for CPU and memory, we end up emitting the application_id as the source_id (https://github.com/cloudfoundry/diego-logging-client/blob/88015a7/client.go#L217), as opposed to keeping the source_id as "rep", but emitting application level information under that metric

sunjayBhatia commented 4 years ago

I would think we would also want to change the source id to be the application id to make the metric useful to app developers if we were going to tag it as suggested

@jvshahid yeah, that is a good point, this could be reframed as an app developer facing feature (so they can keep track of their log usage, sort of another resource usage metric to key off of), or maybe we can conclude that app developers have enough info for now with the log we push into the log stream telling them their app has been rate limited

robertjsullivan commented 4 years ago

The cardinality isn't different than the cpu metric and memory metrics but I think those are more valuable as that data is constantly relevant. In this case, it's a counter that we'd be emitting for every instance, but only relevant to a small few.

If we wanted to emit a logging metric for applications, I think we should emit the total count of log lines from a container. An operator could use the rate() function to determine if it's violating the set threshold. App Devs would also find that useful as well.

aminjam commented 4 years ago

I'm also curious how this cf nozzle approach will work on a larger foundation, inspecting each log line from every app to find the chatty ones seems expensive

@sunjayBhatia We tested this in a large foundation and were able to detect the apps logging in as an admin.

Closing this issue since we have enough information for app developers to know their app is being rate limited.

Please feel free to re-open this if you'd like to see a different outcome.

jkbschmid commented 4 years ago

Hi everyone, I’d like to start the discussion again, since we are having problems with using the Firehose Plugin.

Logging the offending app names to Diego's logs on a set cadence seems like a reasonable solution for identifying the bad actors.

For our purpose, this would be an ideal solution. The code change would hopefully be very small change, since a log message is already emitted, just into application logs and not into the executor log. Even if the same cadence (1s) were used, the additional log load should be manageable, even for large foundations.

What do you think?

aminjam commented 4 years ago

@jkbschmid Thanks for restarting the conversation. Can you clarify a few questions I had:

On our large foundations, where metron agents can send a total of 100k msg/s this doesn’t work properly. Instead, we saw that Dopplers drop lots of messages.

Can you please clarify:

  1. At the time of seeing this issue, was this feature turned on or off?
  2. If turned on, what was the value?
  3. Are there screenshots of KPIs for Dopplers?
  4. Have you tried using the noisy neighbor plugin for detection of the apps?

I remember seeing another issue in this area that might be related to what you are experiencing.

cf-gitbot commented 4 years ago

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

jkbschmid commented 4 years ago

Hi @aminjam, thanks for the quick response! 1/2. The log rate limit was enabled with a (relatively high) limit of 5,000 logs/s and we could already see a positive effect (load on the dopplers decreased). We intentionally set it this high in oder to not impact well-behaving applications. We’d like to set a lower limit, but that’s exactly why we need the visibility.

  1. With 35 Dopplers in our foundation we are already at the scaling limit. The two spikes in "Doppler dropped messages" at 13:52 and 13:53 I got from starting the nozzle twice. Please let me know if I forgot an important metric: image

  2. Yes, we are also making use of the noisy neighbor nozzle. We have two problems, however. The NNN only reports based on logs per minute. If apps only have short “log spikes”, then their logs will be dropped, but they won’t appear in the NNN ranking (we observed such behaviour). We also observed that the NNN’s reports differed by a large amount (multiple 100k logs/min) from the actual throughput at the RLP.

I also had a look at the issue you linked but, since we are already at Doppler's scalability limit, this is unfortunately not an option for us.

aminjam commented 4 years ago

This is quite interesting @jkbschmid! I am very interested in figuring out the spike for Doppler Dropped Per Second.

We intentionally set it this high in order to not impact well-behaving applications.

In our experience, 100 (default) log-lines/second seemed to have allowed the well-behaved applications to operate normally. We were under the impression that if an app emits 5000 log-lines/second, it fits in the noisy-neighbor category. We'd like to learn more about the rates selected by the users, can you share some insights about high rate selection?

With 35 Dopplers in our foundation we are already at the scaling limit.

It looks like Dopplers are receiving around 150K messages/second. I have heard of users being able to manage up to 15-20 Million/Minute (250K/Second) without an issue. I wonder if there are some tuning we could do to figure out what's causing the drop.

Looking at the Ingress rate, I don't see a very big jump during the same timeframe. A few follow up questions:

The NNN only reports based on logs per minute. If apps only have short “log spikes”, then their logs will be dropped, but they won’t appear in the NNN ranking

This is an interesting finding. I don't have a lot of background for the plugin, but I am wondering if we can fix that to report the noisy-neighbor properly.

@pianohacker I am curious to get your thoughts on the loggregator side of things.

pianohacker commented 4 years ago

Wanted to add a few useful bits of info:

  1. You can likely scale to 40 Doppler VMs, and try vertically scaling those as well. 40 is about where we've seen horizontal scaling stop producing better results.

  2. There's a confounding factor that may be present in your dashboards; Doppler (and many other Loggregator components) have two different metrics with the name dropped, distinguished by the direction tag:

    doppler{direction="ingress"} - Increments when the component is overloaded, and can't read incoming messages fast enough. Usually points to a noisy emitter somewhere upstream, or a VM with underscaled CPU. doppler{direction="egress"} - Increments when the component can't get information out the door fast enough. Usually points to something downstream being unable to keep up, or network/load balancer congestion.

    Your dashboard is likely mixing both metrics together, making it difficult to tell what kind of dropping Doppler is having to do. Based on the fact that your system is seeing fairly low dropped messages most of the time, I'm betting that Doppler is dropping messages on egress because your noisy neighbor nozzle is underscaled and can't keep up.

  3. It's possible to scale the noisy neighbor nozzle such that it can keep up with application traffic, but there's an easier way. We just released version 3.1.0 of the Log Cache CLI, which has a new and improved version of the cf log-meta --noise command. Log Cache is already running on your foundation at the necessary scale to support log access for the CF CLI, so it's a fairly trustworthy source of information about app log throughput.

Caveat: All of the above happen downstream from where Diego rate-limits logs, so they can't tell you anything in that circumstance.

Let me know if that helps!

jkbschmid commented 4 years ago

Thanks a lot for your detailed answers, @aminjam and @pianohacker, this was very helpful!

First of all, sorry for the misunderstanding: There are basically two large foundations that I am working on, one with 35 Dopplers, the other with 40, with Dopplers ingressing around 250k msgs/s (per foundation).

I think the foundations you mentioned with regards to messages/second and Doppler scalability are exactly these two (most likely you have already had some discussions about them with Plamen Doychev or Jochen Ehret on Slack or via mail).


ingress rate for only diego-cells

I can’t provide the ingress rate filtered for diego-cells only, but we should see any spikes in the Dashboard for “Metron Agents Overall Sent Messages” under "Doppler overall received" as well, right?

Firehose performance scaling KPIs

Log Transport Loss Rate = 0 loggregator.doppler.dropped{direction=ingress} / X -> 0 Doppler Message Rate Capacity = ~150k msgs/s / 35 Dopplers = *~4,300 msgs/(sDoppler) Reverse Log Proxy Loss Rate = 0 dropped / X -> 0**

can you share some insights about high rate selection?


We chose 5000 lines/s as the initial value in order to limit as few applications as possible. Since there are 40k app instances running on our foundation, the usage and log patterns are quite diverse. The sheer number of apps makes it hard for us to check individually whether a certain log usage pattern is actually required by the stakeholder or whether it should be limited. We’d like to identify and continuously monitor the apps whose logs are dropped to

jkbschmid commented 4 years ago

...continued, in response to @pianohacker:

  1. As mentioned above, we are at 40 Dopplers on the other landscape.

  2. Thanks for the clarification, that was super helpful! This is indeed a confounding factor, which displays both directions in the same Dashboard. As you correctly predicted, the spike was for doppler{direction="egress”}. Seems like the Firehose Nozzle is underscaled and cannot keep up with the amount of logs.

  3. Thanks for the pointer to the Log Cache CLI! Great to have a solution which doesn’t require additional deployments. We already tried it out and it looks like a great way of getting more insight into the logging system without any additional computational requirements.

With the plugins that have been mentioned (and maybe some scaling/fixes), it seems like it is possible to identify apps being hit by the rate limit. There are three advantages I see of something like enriching the infra log over using plugins.

Thanks again for your help so far!

sunjayBhatia commented 4 years ago

@jkbschmid sounds like you would like to see the rep logs print out some details when an app exceeds the log rate limit, I personally think that is reasonable and agree with the advantages you mentioned above

aminjam commented 4 years ago

I am glad that the Log Cache CLI was able to solve the problem after scaling the nozzle properly. If there was to be a log that identified a container-guid reaching the log rate-limit in rep reporting on a set cadence, that would tell us that this instance of the app reached the limit sometime during the interval.

Apps with irregular log spikes are registered (and I guess it is natural for an app to have peak loads and therefore peak log loads)

We wouldn't able to tell the exact time when it reached the limit and would still have to look at the Firehose Nozzle for this information

The information does not pass the loggregator itself (where it may be dropped if the loggregator is overloaded), but is routed via rsyslog

This is true and it will ensure the logs are delivered in rsyslog.

If a container-guid is logged (a single instance of an app), would it be useful and easy to use that information to map it back to the app?

@jkbschmid If you still feel that using this extra piece of information in the log would help with diagnosing the issue faster, I think it would make sense to log the offending containers on a set cadence. Would you be open to a PR with the above change?

Given that you are running a large foundation, I think having a blog post would also be extremely valuable when it comes to using this feature with Log Cache CLI and these potential improvements. Would you be open in creating a post describing the problems you've had and how you were able to figure out and mitigate the issue in your foundation?

jkbschmid commented 4 years ago

Hi @aminjam, I've had a more detailed look at the code and, if my understanding is correct, then there is even an app GUID available in the tags (see here). The problem for us as operators is, that we’d need to enrich this information at least with Organization info (which is not available) in order to make it concretely usable.

The main problems we need to solve are:

  1. Understand which apps are limited by the log rate limit
  2. Identify stakeholders that cause majority of the load on the logging system
  3. Identify and alert stakeholders of apps that are rate limited

For 2), we would need at least the Organizations, for 3) the org or space users of the identified apps as well. So even if we had a log with an app GUID, we wouldn’t be able to diagnose the issue much faster than with the Log Cache CLI plugin. I could still provide a PR, though, if you deem this feature important. Otherwise I’d be okay with closing this issue and using the plugin.

Of course, the blog post idea sounds great! Not sure if this is still blog post material without any change in Diego, but if you’re still interested you can just shoot me a message (via mail or slack) with some details:)

Thanks again for your help!

aminjam commented 4 years ago

@jkbschmid Adding organization and space name to the rep logs is not a common pattern that we do in other places or components. Even if we passed in a logger object, those type of information can get outdated when a user runs cf rename for an org or space. Since we can identify the app using Log Cache CLI, I am leaning toward leaving the capabilities as is.