Closed jordansissel closed 8 years ago
Hi!
Thanks a lot for your feedback! Unfortunately I was really busy the last days (and had a mail filter rule dumping my github messages) so I did not see your message. I'll get through it as soon as I can and will give you the reply you deserve (hopefully).
Cheers, Thomas
You really knocked me off balance with your pull request. This is great. And WOW.
Monitoring not fixed thresholds but variable thresholds and especially the change of values over time is a very hot topic in the monitoring community. I like to use a combination where possible. At my customers I mostly use fixed thresholds because it's an easy and established way to do. But I'm still searching for a way to monitor changes in growing rates over time and we have some ideas how to realize that with tools from the Icinga/Nagios family of montoring tools. Usually we put the performance data output in graphing tool like Graphite or PNP (some wrappers and a webfrontend to rrdtool). I heard of some monitoring plugins to monitor growth rates of data via the Graphite API. A coworker of mine is even testing to create a similar tool for data in Elasticsearch. This check_logstash plugin might just be the last part of motivation for me to look into it to create an example setup. If so, this will definitely go into the Readme of this plugin. (And the next release of the Icinga 2 book, which by now is sadly only available in German)
About things the API is missing:
I'd really love to see some information, maybe just boolean values, that show information like:
The first two examples would be most interesting. I just happened to have a customer (still using Logstash 1.5...) where I wished to have an easy way to find the bottle neck in their Elastic stack. They reconfigure their setup often and from time to time they run new load tests. And often it's not easy to tell if just Elasticsearch is slow at indexing or they introduced a new filter (e.g. dns) that slows down the pipeline. So measuring the in flight events is very important but maybe there could be a better way that just tells me: Your output can't take anymore.
As a monitoring consultant I always struggle with monitoring Logstash. I don't want to read the logstash logs with logstash because I managed to build a loop in a testing environment and this is some sort of destructive load test I don't want to see in production. But when a customer has the Elastic Stack in place I don't want to use other logmanagement tools / plugins like check_logfiles to monitor Logstash. So all the vital information I stated above from the Logstash logfile would be a great addition to the API. These are just some quick thoughts. If I come up with more of it I'll let you know via the Logstash issue queue.
Is Logstash able to send data to all outputs?
Is at least one of the outputs blocked?
How is 'blocked' computed? An excess amount of time?
Is the connection of one of the active inputs (e.g. Redis) not available?
I think we can track errors per plugin, though doing this will be per-plugin specific and require changes to each input. Totally doable, and something we can do gradually.
Is Logstash trying to reload the configuration but the configuration is invalid
+1. We should track failed reload attempts.
Is Logstash configured to reload the configuration
+1 I think we can expose this under /_node/
which shows the settings.
Is Logstash able to send data to all outputs?
Is at least one of the outputs blocked?
How is 'blocked' computed? An excess amount of time?
Amount of time sounds reasonable. But it may be sufficient to show outputs we expect to have an active connection being unreachable.
Is the connection of one of the active inputs (e.g. Redis) not available?
I think we can track errors per plugin, though doing this will be per-plugin specific and require changes to each input. Totally doable, and something we can do gradually.
I really think this could be beneficial for monitoring. Especially those plugins used as caches like redis or kafka.
For the other things you gave +1 do you want me to create issues in the Logstsah queue or will you keep them in mind?
I'll file issues about them, no worries ;P
You saw that I merged your pull request which actually replaced all my code. I'm still stunned over your willingness to help.
I had to change the way you are getting thresholds via commandline to get the plugin conforming to the monitoring plugin development guidelines. Now it's harder to use on the commandline but easier to integrate into tools like Icinga 2 or Nagios.
I also changed some of the performance data output to have a bit more data available. Having open file descriptors as performance data gives the user the change to use separate checks like check_graphite to check for spikes or abnormal rise in the open file descriptors to set off an alert where something strange happens there.
The plugin is still not perfect and misses some extra thresholds but with your help I was able to build something which might make live a lot easier for people using Logstash and Icinga 2, Nagios, Naemon, Shinken, check_mk or tool will come up in this monitoring family in the future.
I think we can close this -- feedback was the main purpose. I like the work you've done here :)
Hello!
I saw this mentioned on twitter and had a look at the code. I am super excited to see what folks build with the new Logstash stats apis.
I have some feedback, if you want it. I'd like to make sure we (logstash) ship useful data so our users can get the most actionable data from monitoring it.
jvm.mem.heap_used_percent
can be useful.Things I learned from reading your nagios check about what we can improve in Logstash: Logstash may still be missing important things about what bad behavior needs action. I don't want to force users to monitor symptoms (cpu utilization) and need human intervention and research to find the actual bad behavior.
I think how I would monitor Logstash would be to capture these stats and diff them over time. For example, many of the behavior-oriented stats are counters. For example, you could take two
/_node/stats
results read 5 seconds apart, subtract a newer value from the older value, and get roughly the rate (derivative) of that metric. I would use this for observing event processing rate, latency (filters and outputs haveduration_in_millis
counter for wallclock time spent in each of them), etc, and alerting on these behavioral properties (data rate, time spent doing work, etc). For pipeline/plugin stats, check out the/_node/stats/pipeline
endpoint.Thoughts?
One final question, is there anything we're obviously missing in the stats outputs?