cloudfoundry / loggregator-release

Cloud Native Logging
Apache License 2.0
215 stars 149 forks source link

Feature Proposal: Forward container metrics to user-specified syslog drain #150

Closed CAFxX closed 6 years ago

CAFxX commented 8 years ago

Some of our users are requesting to receive the metrics of their application in the stream to their syslog endpoint for monitoring and alerting purposes. There are more or less kludgy workarounds that the users themselves can enact to have the metrics in their logs (emit them from their application, have something poll the CC API and emit them as logs). Operators could probably come up with something as well, but as this sounds like a reasonable use case we were wondering if there was any interest in having it upstream.

cf-gitbot commented 8 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/130420679

The labels on this github issue will be updated when the story is started.

AllenDuet commented 8 years ago

Are you referring to application metrics generated by the app itself, or container metrics where an app instance is hosted, e.g. CPU, Memory, disk?

The apps themselves can bind to a service endpoint and emit any custom metric data (e.g. transactions processed, exceptions caught, etc...), in whatever format you'd like to see. This is usually a more straight forward approach to capturing app metrics rather than send through Loggregator. I don't want to confuse this approach with Loggregator's syslog drain config through CUPS -l, I'm suggesting you create an entirely different service to send the app metrics data to.

As you mention Loggregator only supports app logs as an output into firehose, I'm a little confused with the syslog endpoint mention as normally you'd expect to see logs transmit over that protocol, but I think you are asking if some sort of metric format can be sent via that protocol to an endpoint. Can you confirm or clarify my understanding here?

CAFxX commented 8 years ago

I was thinking about container metrics (sorry for the confusion). The use case would be having the user-bound syslog drain (and the log/metrics processing systems behind it) be able to receive the container metrics for each application instance (for user-defined monitoring/alerting purposes).

Edit: issue ticket updated, hopefully it's clearer now

msaliminia commented 8 years ago

+1

msaliminia commented 8 years ago

I'm working on this issue. Does it make sense to make this feature optional? giving the CF operators the option to disable/enable forwarding ContainerMetrics to syslog drain?

CAFxX commented 8 years ago

I think that yes, although in our case we would like to have it enabled, there may be operators who will prefer to keep this disabled

hev commented 8 years ago

@msaliminia we are exploring some other ideas around segregating Container Metrics that would conflict with this PR. I'll keep you posted as this feature matures.

msaliminia commented 8 years ago

@ahevenor ContainerMetrics are already segregated. We want to ship them via syslog sink too.

hev commented 7 years ago

@mariash and @CAFxX we are researching ways to support this in https://www.pivotaltracker.com/story/show/135231265 for this and PR https://github.com/cloudfoundry/loggregator/pull/166

CAFxX commented 7 years ago

@ahevenor great. Just one question: I opened https://www.pivotaltracker.com/n/projects/993188/stories/135231265 but it looks like there are no details and no activity. Are they protected somehow?

hev commented 7 years ago

@CAFxX - no - this is just a research spike for the team to see if we have a way to support this currently. I just dragged to the top of our backlog and should hopefully have some more details tomorrow.

hev commented 7 years ago

@CAFxX it seems @msaliminia pointed this out earlier. We have undocumented support for this through the app stream. We have updated our README to reflect how to configure this. see https://github.com/cloudfoundry/loggregator/blob/develop/src/trafficcontroller/README.md#endpoints

That said we have some architectural changes that will allow more configurability of syslog and support for this request. We are trying to avoid putting additional strain on dopplers. See our Epic on this - https://www.pivotaltracker.com/epic/show/3119911

hev commented 7 years ago

@CAFxX and @msaliminia we have some changes that will scale out syslog seperately and strategically we'd like to seperate metrics and logs rather than combine them. Logs and metrics have different delivery needs hence our thinking here.

CAFxX commented 7 years ago

we'd like to seperate metrics and logs rather than combine them. Logs and metrics have different delivery needs

@ahevenor while I agree that they have different delivery needs, most of our users like to have them in a single place for consumption (normally behind some syslog endpoint, e.g. splunk), so while I understand the idea to keep things separated for better scaling out, please consider that this (having logs and metrics delivered to the same place) is a fairly common request that we receive from our users

hev commented 7 years ago

@CAFxX - some further questions for you to better understand the request.

Is there a specific format that splunk expects metrics rolled in the syslog format? Ideally that would be accomplished by conversion in a nozzle.

That said a nozzle doesn't provide metrics to users without firehose scope (e.g. a multi-tenant environment). Giving developers access to metrics through a separate (new) CUPS service is something we have discussed several times and will be more flexible to with some new architecture changes rolling out.

MatthiasWinzeler commented 7 years ago

@ahevenor I wondered what is the state here; we recently got these requests from our customers, too: They like the way they can use syslog drain to send their logs to the ELK stack, but would like to have the metrics sent there as well. They currently need to implement some workaround which pulls the API and sends the metrics somewhere.

Regarding format for splunk: I can't speak for splunk (but I highly doubt that their syslog endpoint is able to handle metrics), but in our case, the logstash instance fronting the ELK stack could is highly configurable and could possibly handle any kind of metrics format within syslog.

What's the state of the efforts regarding the syslog separation? Thank you!

hev commented 7 years ago

@MatthiasWinzeler it's been a while since I have reviewed this idea. I definitely see the value here and I could see us adding this to the scalable syslog version of syslog drains. How did you format the metrics you sent to ELK? json? How often did you emit them?

MatthiasWinzeler commented 7 years ago

@ahevenor Thanks for coming back to this. To clarify: Our customer would like to get the container metrics, e.g. CPU, memory, disk etc. They have already solved the issue for custom app metrics by sending metrics directly to the desired endpoint (without using syslog drains etc.).

Regarding format: I guess you would to define some format to send the metrics over syslog, i.e. something like etsy's MetricLogster.

But for our use case, the exact format doesn't matter: Anything parseable with a regex is fine, since Logstash is quite powerful transforming inputs (i.e. some metric in a syslog string) to desired outputs (i.e. some graphite endpoint). Example: https://www.elastic.co/guide/en/logstash/current/config-examples.html#_processing_syslog_messages

Side note: In the real implementation, Logstash would probably not send the metric to Elasticsearch/Kibana but to a different backend (i.e. InfluxDB) for Graphana since it offers superior support for metrics.

Edit: Regarding frequency: I assume you've already set a frequency for when you emit the container metrics over the firehose. I think this interval should be fine for syslog, too.

CAFxX commented 7 years ago

We also had to solve this by wrapping the cf cli in a script (deployed with the binary buildpack) that emits json objects to stdout (one for each instance each measurement interval). The logs of this "monitoring application" are then forwarded to the same syslog drain where the logs of the applications being monitored are sent.

Initially we wrapped the cf cli in a loop that would poll the stats API. We then replaced it with the (app-)firehose plugin so that we don't have to put undue strain on the CCs.

hev commented 7 years ago

@MatthiasWinzeler the independently scalable syslog release work is done. We are working through some bosh work now to get it packaged into scalable syslog. I expect this will be complete in one of the next few cf-release cycles.

Another question: How often are you interested in seeing container metrics in drains? Would opting into this during a deployment be an acceptable way to access this feature? It could kind of clog up your drain with lots of extra content.

MatthiasWinzeler commented 7 years ago

@ahevenor Thanks to the team! Exciting to see our use case made possible.

How often are they sent in the current implementation? I guess every 60secs should be completely sufficient.

Regarding opting-in: If possible, we'd prefer it to be the default - since we're working on a monitoring service in the marketplace which relies on the metrics in the drain. If the user needs to enable some flag/redeploy his apps for the service to use, it somehow worsens the user experience.

cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/150639135

The labels on this github issue will be updated when the story is started.

MatthiasWinzeler commented 7 years ago

@ahevenor We'd love to try out the new feature and start building our tools on top of this. Is this already in a deployable state?

In the scalabe-syslog release README there is a BOSH lite standalone manifest, but no hints on how this can be integrated with CF.

Are there any docs/example manifest out there that show how to integrate it with an existing CF deployment?

hev commented 7 years ago

@MatthiasWinzeler You can use scalable syslog by deploying it standalone and then deploying cf with the following ops files -o operations/experimental/disable-etcd.yml -o operations/experimental/enable-rlp.yml

That said, including container metrics not yet built. We're still gathering evidence on the use case and UX. Just to confirm, you are using this in ELK. What kind of monitoring or tooling plans do you plan to create for it?

MatthiasWinzeler commented 7 years ago

@ahevenor Thanks for the explanation.

In our case, the metrics will be sent over syslog to logstash. Logstash will then parse the metrics from the log line and send them to the backend system in the required format. It's not quite clear yet which backend stack will be used (and will also probably differ between departments, team etc.), but possible candidates:

Those would primarily visualize the metrics for analysis etc.

CAFxX commented 7 years ago

@ahevenor in our case logs are sent vis syslog either to splunk or ELK, depending on the user

jasonkeene commented 7 years ago

This is how we are going to encode container metrics:

117 <30>1 2017-10-04T13:00:52.662629-06:00 myhostname someapp - - [gauge@47450 name="cpu" value="0.23" unit="percentage"]
115 <30>1 2017-10-04T13:00:52.662723-06:00 myhostname someapp - - [gauge@47450 name="memory" value="2324" unit="bytes"]
113 <30>1 2017-10-04T13:00:52.662746-06:00 myhostname someapp - - [gauge@47450 name="disk" value="4324" unit="bytes"]
121 <30>1 2017-10-04T13:00:52.662765-06:00 myhostname someapp - - [gauge@47450 name="memory_quota" value="4324" unit="bytes"]
119 <30>1 2017-10-04T13:00:52.662785-06:00 myhostname someapp - - [gauge@47450 name="disk_quota" value="1234" unit="bytes"]
bradylove commented 6 years ago

@ahevenor since this is in cf-syslog-drain-release v3.1+ do we want to close this issue?

hev commented 6 years ago

This is now available in cf-deployment 1.9.0 using the new cf-drain-cli.