fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.54k stars 1.51k forks source link

Prometheus metrics "fluentbit_output_upstream_busy_connections" shows negative value during connection timed out #8868

Open ashishmodi7 opened 1 month ago

ashishmodi7 commented 1 month ago

Bug Report

Describe the bug Prometheus metrics "fluentbit_output_upstream_busy_connections" showing negative value during connection timed out

To Reproduce Steps to reproduce the problem:

  1. Deploy Fluent Bit in Kubernetes (https://docs.fluentbit.io/manual/installation/kubernetes#installing-with-helm-chart)
  2. Configure Port forwarding to view the Prometheus metrics using below command: export POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=fluent-bit,app.kubernetes.io/instance=fluent-bit" -o jsonpath="{.items[0].metadata.name}") kubectl --namespace default port-forward $POD_NAME 2020:2020
  3. Configure Fluent Bit Output to Elastic Search or Splunk Server
  4. When the Elastic Search or Splunk Server is not reachable, it will give connection timed out error.
  5. Check the Prometheus Metrics "fluentbit_output_upstream_busy_connections" showing negative value. curl -s http://127.0.0.1:2020/api/v2/metrics/prometheus|grep conn

Expected behavior Prometheus Metrics "fluentbit_output_upstream_busy_connections" should show 0 or positive value.

Screenshots image

Your Environment

Additional context Monitoring graphs are not showing correct values

drbugfinder-work commented 1 month ago

Verified on my end. I can also see negative values here. fluentbit_output_upstream_busy_connections{name="forward"} -899

douglasawh commented 1 month ago

@drbugfinder-work Is there any additional information we can provide to help get this resolved?

drbugfinder-work commented 1 month ago

Just as a side note This is where the calculation is done (without mutex): https://github.com/fluent/fluent-bit/blob/8aee285464c30d1af03fdfbf1dcbdf784b5ace33/src/flb_upstream.c#L1157-L1214

Called here: https://github.com/fluent/fluent-bit/blob/8aee285464c30d1af03fdfbf1dcbdf784b5ace33/src/flb_upstream.c#L801-L807

(First guess: Is access to the metrics thread-safe? cc @leonardo-albertovich)