Open taitelman opened 1 month ago
Please follow the template and provide all the relevant details required including config, version, environment, etc.?
I presume you're using this? https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit
based on fluentbit documentaiton the health point should:
The health endpoint returns an HTTP status 500 and an error message. Otherwise, the endpoint returns HTTP status 200 and an ok message.
deamon set:
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
version: 1.3.1
creationTimestamp: "2024-03-17T11:24:21Z"
generation: 57
labels:
app: logger-agent-ds
version: 1.3.1
name: logger-agent-ds
namespace: ibm-observe
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
name: logger-agent-ds
template:
metadata:
annotations:
kubectl.kubernetes.io/restartedAt: "2024-09-12T11:13:58Z"
creationTimestamp: null
labels:
app: logger-agent-ds
name: logger-agent-ds
version: 1.3.1
spec:
containers:
- args:
- --config=/fluent-bit/etc/fluent-bit.conf
command:
- /fluent-bit/bin/fluent-bit
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: HOST_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: observe/logs-router-agent:1.3.1
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /api/v1/health/
port: 8081
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 1
name: fluent-bit
ports:
- containerPort: 2020
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /api/v1/health/
port: 8081
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 701m
ephemeral-storage: 10Gi
memory: 3Gi
requests:
cpu: 100m
ephemeral-storage: 2Gi
memory: 1Gi
securityContext:
capabilities:
add:
- DAC_READ_SEARCH
privileged: false
runAsGroup: 10000
runAsUser: 10000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/tokens
name: vault-token
- mountPath: /var/log
name: varlog
readOnly: true
- mountPath: /var/data
name: vardata
readOnly: true
- mountPath: /var/log/fluent-bit
name: varlogfluentbit
- mountPath: /var/lib/docker/containers
name: varlibdockercontainers
readOnly: true
- mountPath: /fluent-bit/etc/
name: logger-agent-config
- mountPath: /fluent-bit/cache
name: fluent-bit-cache
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: all-icr-io
initContainers:
- command:
- scripts/make_db_dir.sh
image: observe/logs-router-agent-init:1.3.1
imagePullPolicy: Always
name: create-db-dir
resources: {}
securityContext:
privileged: true
runAsGroup: 0
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/log
name: varlog
- mountPath: /var/data
name: vardata
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: logger-agent-sa
serviceAccountName: logger-agent-sa
terminationGracePeriodSeconds: 10
tolerations:
- operator: Exists
volumes:
- name: vault-token
projected:
defaultMode: 420
sources:
- serviceAccountToken:
audience: iam
expirationSeconds: 7200
path: vault-token
- hostPath:
path: /var/log
type: ""
name: varlog
- hostPath:
path: /var/data
type: ""
name: vardata
- hostPath:
path: /var/log/fluent-bit
type: ""
name: varlogfluentbit
- hostPath:
path: /var/lib/docker/containers
type: ""
name: varlibdockercontainers
- configMap:
defaultMode: 420
name: logger-agent-config
name: logger-agent-config
- emptyDir:
sizeLimit: 11Gi
name: fluent-bit-cache
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
I have the fluentbit deamon set running in my K8s and I can enter the logging pod and see:
K8s pod config:
yet the configuration is bad or firewall blocks the ingestion point so I get bad readiness. If I
ssh
into the POD:this is misleading response.
if the process is up it should return 500 or alike and not
Connection refused
for that health endpoint. possible to add a an HTTP reason header or log line about the true nature of config issue.connection refused is for severe cases where process fails to start due to null pointer exception or process crashing due to OOM.