fluent / fluent-plugin-prometheus

A fluent plugin that collects metrics and exposes for Prometheus.
Apache License 2.0
257 stars 80 forks source link

Metric comments generated even when no metrics #207

Open patrick-stephens opened 2 years ago

patrick-stephens commented 2 years ago

Following the guidance here: https://docs.fluentd.org/monitoring-fluentd/monitoring-prometheus#step-1-counting-incoming-records-by-prometheus-filter-plugin

Setting up an output plugin metric also generates HELP and TYPE comments for other possible but not provided metrics.

Whilst the spec technically allows this (the wording is they can only exist a maximum of once) it can confuse scraping tools and does not seem right: we should only generate those special comments when a metric exists.

An example using this config:

<source>
  @type forward
  port 5000
</source>

<source>
  @type prometheus
  bind 0.0.0.0
  port 24231
  metrics_path /metrics
</source>

<source>
  @type prometheus_output_monitor
  interval 10
  <labels>
    hostname ${hostname}
  </labels>
</source>

<match **>
  @type null
</match>

This when run produces empty metrics for some which then some scrapers do not like:

$ docker run --user 0 --rm -it -p 24231:24231 -v $PWD:/fluentd/etc ghcr.io/calyptia/fluentd:edge-debian sh -c "fluent-gem install fluent-plugin-prometheus; su fluentd; tini -- /bin/entrypoint.sh -c /fluentd/etc/fluentd.conf"
...
$ curl localhost:24231/metrics
# TYPE fluentd_output_status_buffer_total_bytes gauge
# HELP fluentd_output_status_buffer_total_bytes Current total size of stage and queue buffers.
# TYPE fluentd_output_status_buffer_stage_length gauge
# HELP fluentd_output_status_buffer_stage_length Current length of stage buffers.
# TYPE fluentd_output_status_buffer_stage_byte_size gauge
# HELP fluentd_output_status_buffer_stage_byte_size Current total size of stage buffers.
# TYPE fluentd_output_status_buffer_queue_length gauge
# HELP fluentd_output_status_buffer_queue_length Current length of queue buffers.
# TYPE fluentd_output_status_queue_byte_size gauge
# HELP fluentd_output_status_queue_byte_size Current total size of queue buffers.
# TYPE fluentd_output_status_buffer_available_space_ratio gauge
# HELP fluentd_output_status_buffer_available_space_ratio Ratio of available space in buffer.
# TYPE fluentd_output_status_buffer_newest_timekey gauge
# HELP fluentd_output_status_buffer_newest_timekey Newest timekey in buffer.
# TYPE fluentd_output_status_buffer_oldest_timekey gauge
# HELP fluentd_output_status_buffer_oldest_timekey Oldest timekey in buffer.
# TYPE fluentd_output_status_retry_count gauge
# HELP fluentd_output_status_retry_count Current retry counts.
fluentd_output_status_retry_count{hostname="df5808e5adb2",plugin_id="object:834",type="null"} 0.0
# TYPE fluentd_output_status_num_errors gauge
# HELP fluentd_output_status_num_errors Current number of errors.
fluentd_output_status_num_errors{hostname="df5808e5adb2",plugin_id="object:834",type="null"} 0.0
# TYPE fluentd_output_status_emit_count gauge
# HELP fluentd_output_status_emit_count Current emit counts.
fluentd_output_status_emit_count{hostname="df5808e5adb2",plugin_id="object:834",type="null"} 3.0
# TYPE fluentd_output_status_emit_records gauge
# HELP fluentd_output_status_emit_records Current emit records.
fluentd_output_status_emit_records{hostname="df5808e5adb2",plugin_id="object:834",type="null"} 3.0
# TYPE fluentd_output_status_write_count gauge
# HELP fluentd_output_status_write_count Current write counts.
fluentd_output_status_write_count{hostname="df5808e5adb2",plugin_id="object:834",type="null"} 0.0
# TYPE fluentd_output_status_rollback_count gauge
# HELP fluentd_output_status_rollback_count Current rollback counts.
fluentd_output_status_rollback_count{hostname="df5808e5adb2",plugin_id="object:834",type="null"} 0.0
# TYPE fluentd_output_status_flush_time_count gauge
# HELP fluentd_output_status_flush_time_count Total flush time.
fluentd_output_status_flush_time_count{hostname="df5808e5adb2",plugin_id="object:834",type="null"} 0.0
# TYPE fluentd_output_status_slow_flush_count gauge
# HELP fluentd_output_status_slow_flush_count Current slow flush counts.
fluentd_output_status_slow_flush_count{hostname="df5808e5adb2",plugin_id="object:834",type="null"} 0.0
# TYPE fluentd_output_status_retry_wait gauge
# HELP fluentd_output_status_retry_wait Current retry wait
fluentd_output_status_retry_wait{hostname="df5808e5adb2",plugin_id="object:834",type="null"} 0.0
AntoineC44 commented 1 year ago

To me it is more user friendly having the comments, it shows metrics exists but has not received events yet. Which prometheus parser are you using?

patrick-stephens commented 1 year ago

cmetrics, i.e. the Fluent Bit one, but I believe this will be updated to resolve it however it does mean the two tools in the same ecosystem do not currently work together.

These metrics never receive events: they are never populated with anything in my example and this meant I could never scrape metrics from fluentd until I disabled them completely. This is the main reason I raised the issue: if this was just a transient failure resolved on the next scrape then sure but my concern was other scrapers could fail as well.

AntoineC44 commented 1 year ago

Thanks for the explanation, don't know if this cmetrics parser behavior is the norm or the exception, if it is the exception maybe opening an issue there for fix would be better?