elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.16k stars 4.91k forks source link

Metricbeat - non incremental network usage metrics #2783

Closed mhainfarecom closed 5 years ago

mhainfarecom commented 8 years ago

It will be very useful if metricbeat will end network usage also as non incremental values. Could be delta between time of each execution. We should have this additional counters (or replace existing one) to be able to: 1) track on a same graph cpu and net usage to look for correlation between those metrics. 2) create same rules to trigger actions via Watcher and in Kibana dashboards. This makes development and debugging much easier. 3) For scripting purpose reuse same template for all metrics. 4) Timelion now is not an option, because in production environments we are not gona see Kibana 5 sooner than end of 2017.

tsg commented 8 years ago

Marked as an enhancement request, but to be honest we're unlikely to implement this, because the general strategy is to do derivates at query time via pipeline aggregations. So if we were to add derivates in Metricbeat, they would be only a temporary solution.

It's worth mentioning that TImelion is available in Kibana 4.2 as a plugin.

mhainfarecom commented 8 years ago

I don't understand how can you use incremental metrics. Please show me some real exaples how to use them, because I don't know how can I write triggers based on them nor, how can I present them on a singe graph with other metrics like CPU usage and so on. Please point me to some docs or kibana queries.

tsg commented 8 years ago

Here are the relevant docs for pipeline aggregations: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-derivative-aggregation.html

I think Watcher supports pipeline aggregations, so you should be able to use that in watches.

Unfortunately Kibana doesn't support that yet, so currently the only way of getting derivative graphs in Kibana is via Timelion. We're working on improving on that.

We realize this is quite inconvenient at the the moment (we have the same issue in our sample dashboards), but we're hesitant to add a temporary solution when a much better and complete one is on the horizon.

mhainfarecom commented 7 years ago

It will be always inconvenient because elasticsearch is not always used with kibana. We are using it with spark (Dataframes) and querying incremental data is also very painful in SQL. Is it possible to submit also deltas next to incremental values?

kaem2111 commented 7 years ago

+1 I need this feature also, but for diskio (see https://discuss.elastic.co/t/diskio-growth-additionally-to-total-values/66531). I think, realizing this in a running metricbeat by keeping the previous value in memory are low hanging fruits in contrast to finding the correct precursor later in ES or Kibana. I had to skip showing metricbeat diskio usage in our POC because my performance experts told me, they need diskio deltas!, totals they can summarize by their own using the sum function in Kibana.

trevorndodds commented 7 years ago

I use Grafana (derivative function) to handle these incrementing counters.

kaem2111 commented 7 years ago

your metricbeat video from 7 dec demonstrates again, that we need deltas, otherwise you could only use (limited) timelion deriviate as shown in the video.

ruflin commented 7 years ago

@kaem2111 Thanks for joining into the webinar. Can you elaborate on the "limited" part of the derivatives in timelion?

heilaaks commented 7 years ago

+1 There are two major issues that cause pain for me with Elastic metricbeat. These issues make it hard to use and scale in terms of servers and other users. Yes, both can be classified as laziness and missing competence.

  1. The Timelion plugin is too difficult to use and to educate others to use it. It is hard to find documentation, examples and use cases. The query syntax and use is different than in the other Elasticsearch queries. I want to automate and collect the statistics over REST with Elasticsearch python plugin and store them for example in csv files and Plotly graphs without extra logic for queries for some metrics.

  2. I simply cannot get the filters to work to reduce the metric event amounts. There are too many metrics coming by default with process metrics to scale this up. I would like to have all the system metrics plus very limited set of process specific metrics so that the filters would remove 90% of the unnecessary data and allow more efficient usage of elasticsearch.

I want my metrics simple with filters and deltas :)

ruflin commented 7 years ago

@heilaaks Thanks for the inputs.

ruflin commented 7 years ago

An additional not from my side to this thread, why I like incremental metrics over precalculated deltas in most cases. I'm aware that this is not an answer to some of the above problems, but I thought it's work sharing these thoughts.

In case a data point is lost for whatever reason, the calculated total becomes incorrect. But based on the total always the correct derivatives can be calculated. Also data points can be removed over time for compression and still the correct values can be calculated. A very simple example.

Assuming we have each second a data point for our network data and we have the following 4 data points of total values, one for each second: 1, 3, 6, 9. Getting this as deltas would be: 1, 2, 3, 3. One can be converted to the other. The sum(delta) = 9 can be calculated.

Assuming we now loose for whatever reason, the second data point, the data sets would look as following: 1,6,9, 1,3,3. Now sum(delta) = 7 which is not correct. But we could estimate the lost values from the totals based on the deltas which would be 3.5.

mhainfarecom commented 7 years ago

Hi

Regarding exports. We are pulling this data from Spark and do some data mining and predictions. Most of advance machine learning tools are python based or spark based, and theres no way to do it in kibana. Using delta values allow us to easier preprocess data and filter them. With incremental data we have to always pull everything which is far to much in terms of MB. Right now we are reapplying metrics which are incremental from our custom code to provide deltas.

Best

Mikolaj

heilaaks commented 7 years ago

Hello,

@ruflin

For the metrics, I could see two different use cases from our point of view: 1) in-house development and 2) customer site troubleshooting. Due to various excuses, it is difficult to have dedicated Elasticsearch cluster to store the metrics for case 1). Because the software changes every hour, I would need to store something for the reference. For the case 1), I want also to collect the data from large environments for offline analysis. The large environments are rare and expensive so I do not want to hold them unnecessarily.

For the case 2), the system is too complex for most end users to troubleshoot and we would need to see the metrics in details for at least 3-4 days. I would like also align them visually to search anomalies for example between networking and disk IO. This means that the data has to be moved from customer to development in sane size that goes through a specific tool chain. Due to various excuses, we are not able to dump the Elasticsearch data from customers and import it back for Elastic analytics as of now.

The Kibana PDF export is nice addition but it is a bit limited for these uses cases. The flexibility of for example Plotly graph is very nice for the offline analysis. I attached one example at the end.

Perhaps these points highlight more limitations from our usage and competence point of view than limitations in Elastic SW. If we would have everything nicely in place, running machine learning based analytics in order to observe anomalies from metrics with ELK and e.g. Spark to combine more sophisticated actions would be better approach.

We try to achieve more sophisticated solutions and more suitable architecture, but we are not there yet. What I have been looking for metrics is:

  1. Very simple and lightweight to install on top of Linux to multiple hosts. GUI to analyze and search data manually and high level language library (Python) to export the data e.g. to Plotly graphs and CSV raw data.
  2. Single service host monitor that is supervised in order to survive restarts. The host monitor must give the basic metrics without sweating from host disk (throughput and IOPS) on volume level, networking (throughput and packets) on interface level, host CPU, host CPU stats, host file counts, host memory and process level metrics with filters for CPU and memory.
  3. Single node decent size Docker container(s) for Elasticsearch and Kibana with default dashboard and which can handle metrics from 5-30 hosts so that the browsing of the graphs and data through Kibana is fast so people will love to use it. Also speed is needed if I decide to run e.g. export queries to not delay the metrics inserts that would cause slopes in measured metrics.
  4. Possibility to write monitoring plugins for specific services. I like the selection of plugins so far since they are nicely suited for us. I quickly tried the Kafka plugin but it was missing most of the Kafka monitoring metrics to analyze the Kafka performance itself.

@ruflin. Your comment about incremental metrics and estimates sounds good for the bullet 3 above. This might be good for cases with fast monitoring intervals (e.g once per 1s) and automated analyzes for anomalies. But our case is simple and I am again lazy to write, test and use differently behaving metrics.

For the Metricbeats filters, I just cannot get them to work. I would need the filters to select only few services to send the CPU and memory metrics. This would improve the performance of single node Elasticsearch and Kibana that collects only metrics from multiple hosts.

I think that the problem with the filters is that the processes filter by defaults maps to service name that is in case of e.g. Java and Python just 'java' or 'python' by default. We could do the separation from username that is mapped to service name or from service command line. The metricbeat processor actions do not include filters that would 'match_event' which would allow me to select simply 'kafka,zookeeper,spark'. For the drop_event I need to write not based regexp from multiple fields (username and cmdline) that complicates the syntax.

For example I tried to create a filter that should drop all the events but they still keep coming so obviously I have misunderstood something :) I could use full example from metricbeat.yml for dummies with complex service filtering.

metricbeat.modules:

- module: system
  metricsets:
    - cpu
    - diskio
    - filesystem
    - process
  enabled: true
  period: 10s
  processors:
    - drop_event:
        when:
          regexp:
            username: '.*'

Plotly example: monitor_cluster_type_x_5m_intervals.zip

andrewkroh commented 7 years ago

There's a filtering example here. Note that you need to s/processors/filters/.

ruflin commented 7 years ago

@mhainfarecom @heilaaks Thanks for sharing the insights from your side.

@heilaaks Can you share some details on what you missed on the Kafka side so we can potentially add it? Consumer groups will be part of the next release if that is what you are missing ;-)

heilaaks commented 7 years ago

@ruflin

With my limited competence and not verifying the statements. The Kafka monitoring can generate a lot of data and can get complicated. Also what metrics to get is also a matter of opinion and use case.

I think that the first problem with the Kafka is to understand what is happening inside Kafka streams and to get a view how much the Kafka is for example consuming from the networking point of view. Kafka itself is very fast and I would not be worried about the performance (latencies or data rates) of Kafka in normal case. Because of these reasons, I would like to see from Kibana:

  1. The MessagesInPerSec, BytesInPerSec and BytesOutPerSec per Kafka topic in OneMinuteRate to understand what is ongoing and how fast in each Kafka topic. This helps with Kafka configuration and performance analysis and is a basic metric to analyze the Kafka.

  2. The MessagesInPerSec, BytesInPerSec, BytesOutPerSec per whole cluster to see the totals in OneMinuteRate. This eases evaluation of the whole cluster to compare networking metrics against switching capacity.

  3. The BytesRejectedPerSec, FailedProduceRequestsPerSec, FailedFetchRequestsPerSec, IsrShrinksPerSec, IsrExpandsPerSec, LeaderCount, PartitionCount, OfflinePartitionsCount to follow (possible) failures in OneMinuteRate.

  4. The NetworkProcessorAvgIdlePercent and RequestHandlerAvgIdlePercent in OneMinuteRate are nice to see how much Kafka idles during the tests.

Few things out of the hat without checking manuals that may be helpful:

  1. Check that the data is not duplicated from the hosts. For example if the Kafka cluster contains n hosts, the data from Kafka is likely sent from cluster point of view. That is, polling of one node is likely enough.

  2. Getting the consumer lag of all Kafka topics easily would be a killer feature. But this is difficult since the lag cannot be fetched from the brokers. Also all producers do not support the consumer groups. I like that the Logstash is using the group and this makes possible to get the lags from command line (not good). Integrate the Logstash and the Kafka topic lag to Logstash module.

  3. Propably after basic metrics, I would like to see Produce (RequestsPerSec), FetchConsumer (RequestsPerSec), FetchFollower (RequestsPerSec), Produce (TotalTimeMs), FetchConsumer (TotalTimeMs), FetchFollower (TotalTimeMs) to do actual analysis of the performance.

  4. From end user point of view (as a lazy one), I like if I can automate the installation of Metricbeat with pretty similar configuration into all hosts. If the Metricbeat collects the same metrics from Kafka cluster level from each hosts, I have to do different configuration for different hosts.

  5. When the Kafka metrics start to stack, users probably want to have more complicated filters to select what they want. For example the jmx objects tend to have attribute set like: "Count", "50thPercentile", "75thPercentile", "98thPercentile", "99thPercentile", "999thPercentile", "Min", "Max", "Mean", "StdDev", "Count", "MeanRate", "OneMinuteRate", "FiveMinuteRate", "FifteenMinuteRate" and when everything is on partition level, things may get complicated.

  6. For example from my point of view, I want to have a single node Elasticsearch to store all the metrics and scale up to 5-30 hosts that send the data.

  7. Analyzing the Zookeeper performance with the Kafka is important.

  8. If I have n partitions per topic, I tend to want to see the data on topic level. That is, the queries have to be easy to get the partition level data summed to topic level smoothly in Kibana (examples).

Because of possible complications, I would first implement a basic set that concentrates on topic and cluster level view rather than providing all the details. Make the module simple to use and respond basic analytics needs and then improve.

@andrewkroh

Thank you. I managed to get the filters to work. I wrote and example with tips and tricks in Metricbeat forum posting.

urso commented 7 years ago

@heilaaks that's quite an essay. I don't want to hijack this issue about network metrics from system module point of view for monitoring kafka. It's still a worthwhile discussion though, as we just started with kafka monitoring support. Unfortunately kafka monitoring is quite a beast in comparison to other systems and it will take some time to improve this one even further. Can you please open another github issue or discuss topic for follow up discussions?

falken commented 7 years ago

+1 for this. Having to calculate the deltas ourselves precluded the use of Metrics Beat. Our ES documents have to fit into a specific pattern in order for our dashboard to display them correctly.

I assume this is an issue because the underlying go library gives you a cumulative value, which would require you to maintain the previous value somewhere.

If someone were to put together a PR would you prefer to always include the delta values, or have some sort of configuration?

ruflin commented 7 years ago

@falken Having https://github.com/elastic/kibana/pull/9725 in Kibana 5.4 should make it much easier to deal with derivatives. See also https://github.com/elastic/beats/issues/2783#issuecomment-270066202 for some more reasoning (and rest of thread). Can you share more details on your specific pattern and the use case?

Nodens2k commented 7 years ago

When you try to monitor a rather busy web server, system.network.in/out.bytes metric values periodically overflow as they reach MAX_LONG. The consequence are jagged charts capped at MAX_LONG if you try to visualize the raw data, and charts with negative values if you try to use derivatives.

I find negative values in network usage particularly annoying. They are not only unaesthetic, but they also make legend values like avg, min or current totally useless.

All this could be avoided by getting delta values directly from metricbeat.

pkese commented 7 years ago

Because that's one true way...

Everybody should use Emacs! Everybody should program in C! Everybody should use Apple computers! Everybody should support anti-abortion! Everybody should pray to the same god! Everybody should speak French! 64KB is enough for everyone! Everybody should use incremental metrics!

Can we now please close this ticket...

falken commented 7 years ago

@ruflin We ended up just accumulating this in Logstash before sending to ElasticSearch. It's not very elegant, but it works. We don't actually use Kibana for the affected portion of the app. We query ElasticSearch directly in order to determine if something is out of wack before alerting.

hilt86 commented 7 years ago

@ruflin do you still hold that incremental is superior, considering the interface counters loop and to get a usable graph of what traffic is flowing through my servers now I need this beast? :

.es(index=metricbeat*, timefield=@timestamp, metric=max:system.network.in.bytes).derivative().divide(1048576).lines(fill=2, width=1).color(green).label("Inbound traffic").title("Network traffic (MB/s)"), .es(index=metricbeat*, timefield=@timestamp, metric=max:system.network.out.bytes).derivative().multiply(-1).divide(1048576).lines(fill=2, width=1).color(blue).label("Outbound traffic").legend(columns=2, position=nw)

As discussed that only works on some timeframes - change the period and we have to adjust the query. I'd much rather be investigating interesting patterns than figuring out how to display a simple traffic graph - if Elastic wants metricbeat to be the de-facto choice for collecting metrics then this needs to be way simpler..

pkese commented 7 years ago

@Nodens2k you poor man.

Next time you buy a new machine, just get one with wider registers;
because of course elastic folks won't change this metric.

As explained above, this is the 'one true way'. And besides, even if it wasn't, they are just not that elastic about this issue.

Now regarding your new machine:
according to Poe's law, a 58 bit architecture should suffice for this network metric; there's really no need to get more than that unless you have more than a 10Gb network adapter.

However even with a better machine, you might wish to make sure you're NOT running any of this thing called Javascript inside your Kibana[1], because Javascript can only correctly handle subtraction of 53-bit integers. Any more than that and it is going to loose precision.

Now if you really need to use Kibana on a high traffic server while analyzing your numbers with Javascript, just make sure to restart the server about once a month and you will probably be fine.

Oh and by the way... don't use any of those Sum aggregations on such metrics either. You need to first subtract those numbers and then and then sum them together, rather than the other way around.

So now you know it.
Can we now really close this ticket and move on.

[1] Some browsers have an option to disable Javascript altogether so that should be safe as well.

pkese commented 7 years ago

Yes, I know. I was being sarcastic.

I would like to excuse myself to the whole community and especially to people from Elastic. They are providing us with a great service and a wonderful product, and more than that, they are even giving it away for free. They certainly did not deserve disrespect.

It is my personal opinion, that in this case however they seem to have failed to hear the feedback, so I spiced the discussion a little bit -- apparently in a wrong way.

No matter what dealing with network metrics is a major pain and I don't think that
.es(index=metricbeat, timefield=@timestamp, metric=max:system.network.in.bytes).derivative().divide(1048576).lines(fill=2, width=1).color(green).label("Inbound traffic").title("Network traffic (MB/s)"), .es(index=metricbeat, timefield=@timestamp, metric=max:system.network.out.bytes).derivative().multiply(-1).divide(1048576).lines(fill=2, width=1).color(blue).label("Outbound traffic").legend(columns=2, position=nw)
is a proper solution for it.

Our metricbeat indices have thousands of fields and I can't see how 2 more would cause a major damage in this department.

Sorry again and thanks to @falken for pointing out his disagreement with my ways.

ruflin commented 7 years ago

@hilt86 @pkese Thanks for bringing up this issues again. Not directly implementing the change does not mean we are not listening. It's good to see people being passionate about the product.

A few thoughts from my side since my last commend in March:

So far the main solution that is discussed is to add the non incremental values to metricbeat which would be pretty easy to do from an engineering perspective. I don't worry about adding 2 fields, I more worry about to how many other fields the logic applies? The network metrics mentioned here are definitively the two low hanging fruits. An other solution is what @falken did to do the conversion in LS or ES. We could even provide ingest pipelines for that in metricbeat. Other options of implementation?

simianhacker commented 7 years ago

If we went with incremental counters then in Kibana's TSVB you would need to run a cumulative sum and then a derivative (with unit set to 1s and a positive only agg) to maintain the ability to zoom in an zoom out using the auto interval (Not sure what I was thinking... you can zoom in and out it would just be averaged together) you could never scale to per minute, you will always be stuck at a per second sampling. Leaving the counters in place, you only need to do a derivative (and positive only), the derivative agg allows you to set the unit (per minute, per second, etc). The issue with the 53-bit integers in JavaScript is a non-issue for TSVB since all the processing is happening in Elasticsearch. Here is what the system.network.in.bytes looks like in TSVB:

image

@pkese The "sum" network traffic is handled by the "Series Agg" TSVB special aggregation. You have to split the series by the hosts you're trying to aggregate together, do all the calculations (derivative) on each individual host metric, then aggregate those values back together (for each bucket). Having incremental counters doesn't fix this because you would still need to do the cumulative sum/derivative trick to maintain zooming.

@Nodens2k We added an aggregation to TSVB called "positive only" that you should use with derivatives that will drop all those negative dips when the counters reset.

Here is a video on using TSVB to visualize rates: https://youtu.be/CNR-4kZ6v_E

As far as moving to incremental counters I'm :-1: From a visualizing perspective I don't think it provides you with any advantages over the current counters we have today.

ghost commented 6 years ago

I would love incremental counters for the Metricbeat network and disk based values as we are stuck using Kibana 4.5 :(

petrkalina commented 6 years ago

I switched from collectd and dockerbeats to metricbeat in order to simplify and unify our collection of runtime/docker stats. The motivation was to get in-line with current development and simplify the deployment. We wanted to use the data to display basic graphs i.e. CPU, memory and disk IO usage. The graphs are not rendered in Kibana, but within the web UI of the software we develop.

Currently we face the problem how to use the incremental disk IO metrics provided by metricbeat docker module to render graphs that outline the demands of the system as it evolves over time - i.e. read/write bytes p. second or a similar information. I guess the same will be true for other metrics as well.

For this use-case, I believe it would be an advantage to have non-incremental / delta presentation of the metrics available as well. The argument for its inclusion would therefore be to enable easy adoption of metricbeat over collectd and dockerbeats for non-expert users.

jsoriano commented 5 years ago

I am going to close this issue as it is unlikely that this is implemented on metricbeat. I am not going to follow with the discussion because there are already enough arguments here for all kinds of opinions :slightly_smiling_face: Also since the issue opened TSVB was released and this provides a general solution to visualize derived values of the collected data. Please use https://discuss.elastic.co for specific questions about the usage of TSVB or the data collected by metricbeat.