Some feedback :) - Githubissues

jordansissel commented 8 years ago

Hello!

I saw this mentioned on twitter and had a look at the code. I am super excited to see what folks build with the new Logstash stats apis.

I have some feedback, if you want it. I'd like to make sure we (logstash) ship useful data so our users can get the most actionable data from monitoring it.

cpu usage - I don't consider cpu utilization something worth alerting on for Logstash because it's not a behavioral problem. It might be a symptom of a problem, but high cpu usage itself is not necessarily a strong indicator of a problem. Personally, I wouldn't want to be paged about high CPU alone in Logstash.
file descriptor usage - Depending on your configuration, file descriptor usage may vary wildly. We have users who watch 100,000s of files with the file input, for example. I wouldn't want to be paged on a fixed number. I also probably wouldn't want to be paged on high file descriptor counts, because it's a symptom, and a possible false-positive, not a behavioral problem.
heap usage - I didn't see this monitored, but watching for high values of jvm.mem.heap_used_percent can be useful.
jvm.uptime_in_millis - I didn't see this monitored, but you could monitor it and if it remains a small value for too long, this could be an indication of a flapping Logstash process (say, if the value is less than 1 minutes of uptime for more than 5 minutes, or something)

Things I learned from reading your nagios check about what we can improve in Logstash: Logstash may still be missing important things about what bad behavior needs action. I don't want to force users to monitor symptoms (cpu utilization) and need human intervention and research to find the actual bad behavior.

I think how I would monitor Logstash would be to capture these stats and diff them over time. For example, many of the behavior-oriented stats are counters. For example, you could take two /_node/stats results read 5 seconds apart, subtract a newer value from the older value, and get roughly the rate (derivative) of that metric. I would use this for observing event processing rate, latency (filters and outputs have duration_in_millis counter for wallclock time spent in each of them), etc, and alerting on these behavioral properties (data rate, time spent doing work, etc). For pipeline/plugin stats, check out the /_node/stats/pipeline endpoint.

Thoughts?

One final question, is there anything we're obviously missing in the stats outputs?

widhalmt commented 8 years ago

Hi!

Thanks a lot for your feedback! Unfortunately I was really busy the last days (and had a mail filter rule dumping my github messages) so I did not see your message. I'll get through it as soon as I can and will give you the reply you deserve (hopefully).

Cheers, Thomas

widhalmt commented 8 years ago

You really knocked me off balance with your pull request. This is great. And WOW.

You're absolutely right. CPU usage isn't something to alert someone about. I'll definitely have it as performance data but alerting is not something I'll keep. Better keep an eye on the load on the node.
As far as I understand your pull request, file descriptors are already dealt with. :-)
heap usage is definitely a thing to monitor but a thing I didn't have the time to implement by now. But you did it. Thanks.
Yes, you're right, a check for uptime might be really useful.

Monitoring not fixed thresholds but variable thresholds and especially the change of values over time is a very hot topic in the monitoring community. I like to use a combination where possible. At my customers I mostly use fixed thresholds because it's an easy and established way to do. But I'm still searching for a way to monitor changes in growing rates over time and we have some ideas how to realize that with tools from the Icinga/Nagios family of montoring tools. Usually we put the performance data output in graphing tool like Graphite or PNP (some wrappers and a webfrontend to rrdtool). I heard of some monitoring plugins to monitor growth rates of data via the Graphite API. A coworker of mine is even testing to create a similar tool for data in Elasticsearch. This check_logstash plugin might just be the last part of motivation for me to look into it to create an example setup. If so, this will definitely go into the Readme of this plugin. (And the next release of the Icinga 2 book, which by now is sadly only available in German)

About things the API is missing:

I'd really love to see some information, maybe just boolean values, that show information like:

Is Logstash able to send data to all outputs?
Is at least one of the outputs blocked?
Is the connection of one of the active inputs (e.g. Redis) not available?
Is Logstash trying to reload the configuration but the configuration is invalid
Is Logstash configured to reload the configuration

The first two examples would be most interesting. I just happened to have a customer (still using Logstash 1.5...) where I wished to have an easy way to find the bottle neck in their Elastic stack. They reconfigure their setup often and from time to time they run new load tests. And often it's not easy to tell if just Elasticsearch is slow at indexing or they introduced a new filter (e.g. dns) that slows down the pipeline. So measuring the in flight events is very important but maybe there could be a better way that just tells me: Your output can't take anymore.

As a monitoring consultant I always struggle with monitoring Logstash. I don't want to read the logstash logs with logstash because I managed to build a loop in a testing environment and this is some sort of destructive load test I don't want to see in production. But when a customer has the Elastic Stack in place I don't want to use other logmanagement tools / plugins like check_logfiles to monitor Logstash. So all the vital information I stated above from the Logstash logfile would be a great addition to the API. These are just some quick thoughts. If I come up with more of it I'll let you know via the Logstash issue queue.

jordansissel commented 8 years ago

Is Logstash able to send data to all outputs?

Is at least one of the outputs blocked?

How is 'blocked' computed? An excess amount of time?

Is the connection of one of the active inputs (e.g. Redis) not available?

I think we can track errors per plugin, though doing this will be per-plugin specific and require changes to each input. Totally doable, and something we can do gradually.

Is Logstash trying to reload the configuration but the configuration is invalid

+1. We should track failed reload attempts.

Is Logstash configured to reload the configuration

+1 I think we can expose this under /_node/ which shows the settings.

widhalmt commented 8 years ago

Is Logstash able to send data to all outputs?

Is at least one of the outputs blocked?

How is 'blocked' computed? An excess amount of time?

Amount of time sounds reasonable. But it may be sufficient to show outputs we expect to have an active connection being unreachable.

Is the connection of one of the active inputs (e.g. Redis) not available?

I think we can track errors per plugin, though doing this will be per-plugin specific and require changes to each input. Totally doable, and something we can do gradually.

I really think this could be beneficial for monitoring. Especially those plugins used as caches like redis or kafka.

For the other things you gave +1 do you want me to create issues in the Logstsah queue or will you keep them in mind?

jordansissel commented 8 years ago

I'll file issues about them, no worries ;P

widhalmt commented 8 years ago

You saw that I merged your pull request which actually replaced all my code. I'm still stunned over your willingness to help.

I had to change the way you are getting thresholds via commandline to get the plugin conforming to the monitoring plugin development guidelines. Now it's harder to use on the commandline but easier to integrate into tools like Icinga 2 or Nagios.

I also changed some of the performance data output to have a bit more data available. Having open file descriptors as performance data gives the user the change to use separate checks like check_graphite to check for spikes or abnormal rise in the open file descriptors to set off an alert where something strange happens there.

The plugin is still not perfect and misses some extra thresholds but with your help I was able to build something which might make live a lot easier for people using Logstash and Icinga 2, Nagios, Naemon, Shinken, check_mk or tool will come up in this monitoring family in the future.

jordansissel commented 8 years ago

I think we can close this -- feedback was the main purpose. I like the work you've done here :)

NETWAYS / check_logstash

Some feedback :) #6