Process Metrics (Top 10)

hisa-tanaka commented 2 years ago

Is your feature request related to a problem? Please describe.

We have to always know what processes to be monitored with the current process log based metrics in Fluent Bit in advance. However one of the use cases in the production operation is we'd like to identify what process(es) are consuming more resources. Today, we have to specify a specific process to monitor. The other way of saying is when there is an issue, we don’t necessarily know what the process is that we want to monitor, log based metrics requires this to be known in advance.

Describe the solution you'd like There has been always greater if fluent bit can get metrics for the top processes, for example if CPU is > 90% on a server, it'll allow people to identify which process(es) are consuming the most CPU and it'll reduce MTTI.

Additional Notes As a reference, this is similar to one in the prometheus node_exporter https://github.com/prometheus/node_exporter/blob/master/collector/processes_linux.go

patrick-stephens commented 1 year ago

Typically this would be something I would use process exporter for rather than node exporter: https://github.com/ncabatoff/process-exporter There is also a dashboard then for this: https://grafana.com/grafana/dashboards/249-named-processes/

It may be a good idea to do something similar with a new exporter just for processes - I can see there being users who want node or processes (some wanting both too) so otherwise we need a way to configure node exporter to include only what we want. Plus there are the concerns around permissions and container usage too.

The exporter should give you metrics of all I think and then you can query however you want in Grafana, etc. - some people will want top 10, some top 25, etc. so I don't think that should be part of the exporter. I did something along these lines for the benchmarking work here:

https://github.com/calyptia/benchmarking/blob/c0afe28ced2c6f6bfa9c9e8bea4b3cd5ff6bcf47/aggregator/scripts/provision.sh#L53-L55

I can query in Grafana like so: sum by(groupname) (topk(20, namedprocess_namegroup_cpu_seconds_total{source="node"})) Screenshot from 2022-09-01 11-21-41

hisa-tanaka commented 1 year ago

@patrick-stephens Thank you

Process exporter could work assuming this lists all processes, but is there an option to filter/match by process name and also choose which metrics/counter/gauges to collect? Otherwise the concern is this could be too much information to send for every endpoints. Understand that top 10/25 could be done with queries on the collected data, but configuring metrics and frequency at source would be beneficial.

Would this be possible with a modified process exporter for Fluentbit?

patrick-stephens commented 1 year ago

Prometheus already provides a few options for controlling what is scraped, including dropping metrics you do not want: https://www.robustperception.io/dropping-metrics-at-scrape-time-with-prometheus/ Scraping frequency per target is similarly controllable, e.g. some docs from Grafana:

The important question is probably what to do for remote_write so this may need functionality equivalent to relabelling to drop on that output plugin: https://grafana.com/docs/grafana-cloud/billing-and-usage/control-prometheus-metrics-usage/usage-reduction/#controlling-remote-write-behavior-using-write_relabel_configs This could be done through a local/in-cluster Prometheus instance that proxies metrics to a remote Prometheus - which is actually a good idea anyway to have a local store (and alerting, etc.) in case of networking or remote failure. The local Prometheus can remote write but drop metrics then to the remote Prometheus, it's a fairly common pattern.

Personally my view is we (or any exporter) should provide a superset of the possible metrics people are interested in and then the individual use case can configure for their specific needs. Additional complexity on the provider side (both in implementation but also with managing that configuration by the user) seems unnecessary here.

With regards to process-exporter, I do not want to go into the details of how to configure it here (this is a feature request for similar functionality on Fluent Bit), but there are options to configure what processes to monitor: https://github.com/ncabatoff/process-exporter#using-a-config-file-process-selectors

On the topic of this feature request, I agree it would be good to provide one or both of the following:

Node exporter equivalent for summary process information
Process exporter equivalent for more detailed information

Similarly an update to the remote_write output plugin to natively support dropping metrics via an equivalent of the relabel syntax would be good I think. So please submit PRs and if you need anything feel free to ping me.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

hisa-tanaka commented 1 year ago

Keep the issue opened.

patrick-stephens commented 8 months ago

@cosmo0920 is this in place now?

cosmo0920 commented 8 months ago

Not sure but process_exporter_metrics can handle the metrics for CPU, memory, threads, and others per process.

fluent / fluent-bit

Process Metrics (Top 10) #5958