fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.84k stars 1.58k forks source link

Prometheus metrics "fluentbit_output_upstream_busy_connections" shows negative value during connection timed out #8868

Open ashishmodi7 opened 5 months ago

ashishmodi7 commented 5 months ago

Bug Report

Describe the bug Prometheus metrics "fluentbit_output_upstream_busy_connections" showing negative value during connection timed out

To Reproduce Steps to reproduce the problem:

  1. Deploy Fluent Bit in Kubernetes (https://docs.fluentbit.io/manual/installation/kubernetes#installing-with-helm-chart)
  2. Configure Port forwarding to view the Prometheus metrics using below command: export POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=fluent-bit,app.kubernetes.io/instance=fluent-bit" -o jsonpath="{.items[0].metadata.name}") kubectl --namespace default port-forward $POD_NAME 2020:2020
  3. Configure Fluent Bit Output to Elastic Search or Splunk Server
  4. When the Elastic Search or Splunk Server is not reachable, it will give connection timed out error.
  5. Check the Prometheus Metrics "fluentbit_output_upstream_busy_connections" showing negative value. curl -s http://127.0.0.1:2020/api/v2/metrics/prometheus|grep conn

Expected behavior Prometheus Metrics "fluentbit_output_upstream_busy_connections" should show 0 or positive value.

Screenshots image

Your Environment

Additional context Monitoring graphs are not showing correct values

drbugfinder-work commented 5 months ago

Verified on my end. I can also see negative values here. fluentbit_output_upstream_busy_connections{name="forward"} -899

douglasawh commented 5 months ago

@drbugfinder-work Is there any additional information we can provide to help get this resolved?

drbugfinder-work commented 5 months ago

Just as a side note This is where the calculation is done (without mutex): https://github.com/fluent/fluent-bit/blob/8aee285464c30d1af03fdfbf1dcbdf784b5ace33/src/flb_upstream.c#L1157-L1214

Called here: https://github.com/fluent/fluent-bit/blob/8aee285464c30d1af03fdfbf1dcbdf784b5ace33/src/flb_upstream.c#L801-L807

(First guess: Is access to the metrics thread-safe? cc @leonardo-albertovich)

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

ashishmodi7 commented 2 months ago

Hello Team, any update on this issue?

RohitKhurana88 commented 2 weeks ago

Hello Team, May we get an update on this issue?