fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.88k stars 1.59k forks source link

When ingestion endpoint is not reachable : health endpoint should return 5xx HTTP error. #9492

Open taitelman opened 1 month ago

taitelman commented 1 month ago
$kubectl version
Client Version: v1.30.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.5+IKS
Fluent Bit v3.1.4-ibm
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io/

______ _                  _    ______ _ _           _____  __
|  ___| |                | |   | ___ (_) |         |____ |/  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __   / /`| |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / /   \ \ | |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /.___/ /_| |_
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/ \____(_)___/

Registering the logger-agent-plugin CommitSHA: e3e664b3cde6cd9f120036d6767cd0717f546b12
Registering the logger-icl-output-plugin with commitSHA: c257a37dc8119d8906e1be191998ed8d4a4beb3c
[2024/10/14 13:52:52] [ info] [fluent bit] version=3.1.4-ibm, commit=, pid=1
[2024/10/14 13:52:52] [ info] [storage] ver=1.5.2, type=memory+filesystem, sync=normal, checksum=off, max_chunks_up=192
[2024/10/14 13:52:52] [ info] [storage] backlog input plugin: storage_backlog.1
[2024/10/14 13:52:52] [ info] [cmetrics] version=0.9.1
[2024/10/14 13:52:52] [ info] [ctraces ] version=0.5.2

I have the fluentbit deamon set running in my K8s and I can enter the logging pod and see:

bash-5.1$ ps -Af
UID         PID   PPID  C STIME TTY          TIME CMD
10000         1      0  1 Oct14 ?        00:28:54 /fluent-bit/bin/fluent-bit --config=/fluent-bit/etc/fluent-bit.conf
10000        33      0  0 14:34 pts/0    00:00:00 /bin/bash
10000        44     33  0 14:34 pts/0    00:00:00 ps -Af

K8s pod config:

        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /api/v1/health/
            port: 8081
            scheme: HTTP

yet the configuration is bad or firewall blocks the ingestion point so I get bad readiness. If I ssh into the POD:

bash-5.1$ curl localhost:8081/api/v1/health
curl: (7) Failed to connect to localhost port 8081: Connection refused

this is misleading response.

if the process is up it should return 500 or alike and not Connection refused for that health endpoint. possible to add a an HTTP reason header or log line about the true nature of config issue.

connection refused is for severe cases where process fails to start due to null pointer exception or process crashing due to OOM.

patrick-stephens commented 1 month ago

Please follow the template and provide all the relevant details required including config, version, environment, etc.?

I presume you're using this? https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit

taitelman commented 1 month ago

based on fluentbit documentaiton the health point should: The health endpoint returns an HTTP status 500 and an error message. Otherwise, the endpoint returns HTTP status 200 and an ok message.

taitelman commented 1 month ago

deamon set:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    version: 1.3.1
  creationTimestamp: "2024-03-17T11:24:21Z"
  generation: 57
  labels:
    app: logger-agent-ds
    version: 1.3.1
  name: logger-agent-ds
  namespace: ibm-observe
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: logger-agent-ds
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: "2024-09-12T11:13:58Z"
      creationTimestamp: null
      labels:
        app: logger-agent-ds
        name: logger-agent-ds
        version: 1.3.1
    spec:
      containers:
      - args:
        - --config=/fluent-bit/etc/fluent-bit.conf
        command:
        - /fluent-bit/bin/fluent-bit
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: HOST_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: observe/logs-router-agent:1.3.1
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /api/v1/health/
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 20
          successThreshold: 1
          timeoutSeconds: 1
        name: fluent-bit
        ports:
        - containerPort: 2020
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /api/v1/health/
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 20
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 701m
            ephemeral-storage: 10Gi
            memory: 3Gi
          requests:
            cpu: 100m
            ephemeral-storage: 2Gi
            memory: 1Gi
        securityContext:
          capabilities:
            add:
            - DAC_READ_SEARCH
          privileged: false
          runAsGroup: 10000
          runAsUser: 10000
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/run/secrets/tokens
          name: vault-token
        - mountPath: /var/log
          name: varlog
          readOnly: true
        - mountPath: /var/data
          name: vardata
          readOnly: true
        - mountPath: /var/log/fluent-bit
          name: varlogfluentbit
        - mountPath: /var/lib/docker/containers
          name: varlibdockercontainers
          readOnly: true
        - mountPath: /fluent-bit/etc/
          name: logger-agent-config
        - mountPath: /fluent-bit/cache
          name: fluent-bit-cache
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: all-icr-io
      initContainers:
      - command:
        - scripts/make_db_dir.sh
        image: observe/logs-router-agent-init:1.3.1
        imagePullPolicy: Always
        name: create-db-dir
        resources: {}
        securityContext:
          privileged: true
          runAsGroup: 0
          runAsUser: 0
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/log
          name: varlog
        - mountPath: /var/data
          name: vardata
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: logger-agent-sa
      serviceAccountName: logger-agent-sa
      terminationGracePeriodSeconds: 10
      tolerations:
      - operator: Exists
      volumes:
      - name: vault-token
        projected:
          defaultMode: 420
          sources:
          - serviceAccountToken:
              audience: iam
              expirationSeconds: 7200
              path: vault-token
      - hostPath:
          path: /var/log
          type: ""
        name: varlog
      - hostPath:
          path: /var/data
          type: ""
        name: vardata
      - hostPath:
          path: /var/log/fluent-bit
          type: ""
        name: varlogfluentbit
      - hostPath:
          path: /var/lib/docker/containers
          type: ""
        name: varlibdockercontainers
      - configMap:
          defaultMode: 420
          name: logger-agent-config
        name: logger-agent-config
      - emptyDir:
          sizeLimit: 11Gi
        name: fluent-bit-cache
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate