fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.84k stars 1.58k forks source link

For some reason, fluent-bit stays like a "sleep". Stop process log through the tail plugin in Kubernetes. #5799

Closed reverson-farfetch closed 1 year ago

reverson-farfetch commented 2 years ago

Bug Report

Describe the bug (Sorry in advance for the long post) For some reason, fluent-bit stays like a "sleep", stopping process logs through the tail plugin in Kubernetes. Which I could check when the files that were tailing by plugin were deleted (by rotation or when the pod is deleted), fluent-bit keep running, I can see metrics, I can check the health check responding but no logs are processing.
Looking at the metrics fluentbit_output_proc_bytes_total and fluentbit_input_bytes_total those are with 0 value and fluent-bit stop to logging itself.
Two weird things:

Regarding when the strace command, here are the last lines of the fluent-bit pod that istoppeded process:

[2022/07/26 22:53:39] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=517324 watch_fd=140
[2022/07/26 22:54:00] [ info] [input:tail:tail.5] inotify_fs_add(): inode=517326 watch_fd=143 name=/var/log/containers/collector-vssvf_stackrox_compliance-4b351ee14fc1331520054006717087279350236cb80521ab7c0f1144
fc5539cf.log
[2022/07/26 22:59:33] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2022/07/26 23:02:52] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=517325 watch_fd=142
[2022/07/26 23:03:30] [ info] [input:tail:tail.5] inotify_fs_add(): inode=517328 watch_fd=144 name=/var/log/containers/collector-vssvf_stackrox_compliance-fa99f0772048faa1b4df45c80c502acc29fbce55f6c8ae6f4f4bc532
296e939c.log
[2022/07/26 23:09:51] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2022/07/26 23:20:30] [ info] [filter:kubernetes:kubernetes.0]  token updated

So, I created an ephemeral container in this pod and check the file descriptor for the fluent-bit process. All file above are files that were rotated or the pods were deleted:

PID   USER     TIME  COMMAND
    1 65535     0:00 /pause
    7 root     19:32 /fluent-bit/bin/fluent-bit /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf
   57 root      0:00 /bin/bash
   63 root      0:00 bash
   69 root      0:00 ps
bash-5.1# ls -l /proc/7/fd | grep delete
lr-x------    1 root     root            64 Jul 25 15:49 57 -> /var/log/pods/<omitted-pod_name>-7bfb8555d6-qnk8q_04ba1b19-a5c7-4018-8363-422d23c9c395/<omitted-container_name>/204.log (deleted)
lr-x------    1 root     root            64 Jul 25 15:49 58 -> /var/log/pods/stackrox_collector-vssvf_832b5829-9afa-4b11-8843-f8ca5107b43e/compliance/75.log (deleted)
lr-x------    1 root     root            64 Jul 25 15:49 62 -> /var/log/pods/kube-system_csi-azuredisk-node-xzfb2_cb5f8f38-348c-4ec8-9ec4-ded1fa515d25/azuredisk/0.log (deleted)
lr-x------    1 root     root            64 Jul 25 15:49 63 -> /var/log/pods/kube-system_csi-azuredisk-node-xzfb2_cb5f8f38-348c-4ec8-9ec4-ded1fa515d25/liveness-probe/0.log (deleted)
lr-x------    1 root     root            64 Jul 25 15:49 64 -> /var/log/pods/kube-system_csi-azuredisk-node-xzfb2_cb5f8f38-348c-4ec8-9ec4-ded1fa515d25/node-driver-registrar/0.log (deleted)
lr-x------    1 root     root            64 Jul 25 15:49 84 -> /var/log/pods/<omitted-pod_name>-7bfb8555d6-qnk8q_04ba1b19-a5c7-4018-8363-422d23c9c395/omitted-container_name/203.log (deleted)
lr-x------    1 root     root            64 Jul 25 15:49 86 -> /var/log/pods/kube-system_konnectivity-agent-784d99587d-wcc5b_d3e0a095-99c8-471c-87f0-fc5c4dd15b7c/konnectivity-agent/0.log (deleted)
lr-x------    1 root     root            64 Jul 25 15:49 90 -> /var/log/pods/stackrox_collector-vssvf_832b5829-9afa-4b11-8843-f8ca5107b43e/compliance/74.log (deleted)

After that I ran the strace command, there are the starts lines from de command:

strace: Process 7 attached with 45 threads
[pid    56] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid    55] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid    54] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid    53] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid    52] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
<suppress some lines>
[pid    20] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid    19] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid    18] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid    17] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid    16] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid    15] epoll_wait(111,  <unfinished ...>
[pid    14] epoll_wait(11,  <unfinished ...>
[pid    13] read(230,  <unfinished ...>
[pid     7] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid    13] <... read resumed>0x7f4028685003, 5) = -1 EAGAIN (Resource temporarily unavailable)
[pid    13] epoll_wait(8, [{events=EPOLLIN, data={u32=707716608, u64=139913562351104}}, {events=EPOLLIN, data={u32=707715840, u64=139913562350336}}, {events=EPOLLIN, data={u32=688074864, u64=139913542709360}}, {
events=EPOLLIN, data={u32=688070656, u64=139913542705152}}, {events=EPOLLIN, data={u32=688074656, u64=139913542709152}}, {events=EPOLLIN, data={u32=688073104, u64=139913542707600}}, {events=EPOLLIN, data={u32=68
8075024, u64=139913542709520}}, {events=EPOLLIN, data={u32=688076784, u64=139913542711280}}, {events=EPOLLIN, data={u32=678891648, u64=139913533526144}}, {events=EPOLLIN, data={u32=706946976, u64=139913561581472
}}, {events=EPOLLIN, data={u32=678665904, u64=139913533300400}}, {events=EPOLLIN, data={u32=688071344, u64=139913542705840}}, {events=EPOLLIN, data={u32=688070864, u64=139913542705360}}, {events=EPOLLIN, data={u
32=688071184, u64=139913542705680}}, {events=EPOLLIN, data={u32=688071024, u64=139913542705520}}, {events=EPOLLIN, data={u32=688076464, u64=139913542710960}}, {events=EPOLLIN, data={u32=688073264, u64=1399135427
07760}}, {events=EPOLLIN, data={u32=688076144, u64=139913542710640}}, {events=EPOLLIN, data={u32=688075504, u64=139913542710000}}, {events=EPOLLIN, data={u32=688072944, u64=139913542707440}}, {events=EPOLLIN, da
ta={u32=688077104, u64=139913542711600}}, {events=EPOLLIN, data={u32=688076944, u64=139913542711440}}], 256, -1) = 22

full file strace.log

After the strace command had been executed, the logs started to be processed again. Here is the log right after the strace command was executed. We can see the link rotation for files had been executed and the files started processing:

[2022/07/26 23:09:51] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2022/07/26 23:20:30] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2022/07/27 09:32:51] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inode=517254 link_rotated: /var/log/containers/csi-azuredisk-node-xzfb2_kube-system_azuredisk-dab513fd87f9702bb1b8cdad2c85586ea3d683275436ac1891065d24ea6050d4.lo
g
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inode=517247 handle rotation(): /var/log/containers/csi-azuredisk-node-xzfb2_kube-system_azuredisk-dab513fd87f9702bb1b8cdad2c85586ea3d683275436ac1891065d24ea6050
d4.log => /var/log/pods/kube-system_csi-azuredisk-node-xzfb2_cb5f8f38-348c-4ec8-9ec4-ded1fa515d25/azuredisk/0.log (deleted)
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inode=517037 link_rotated: /var/log/containers/csi-azuredisk-node-xzfb2_kube-system_liveness-probe-8fd0cfb41784579987e665f9d11dac0b17f7b57c696b7e5e685248f1735d35
d8.log
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inode=517038 handle rotation(): /var/log/containers/csi-azuredisk-node-xzfb2_kube-system_liveness-probe-8fd0cfb41784579987e665f9d11dac0b17f7b57c696b7e5e685248f17
35d35d8.log => /var/log/pods/kube-system_csi-azuredisk-node-xzfb2_cb5f8f38-348c-4ec8-9ec4-ded1fa515d25/liveness-probe/0.log (deleted)
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inode=517157 link_rotated: /var/log/containers/csi-azuredisk-node-xzfb2_kube-system_node-driver-registrar-ea8140cf3e5505668afb091dffb903299d5b544621e8563ac8ea811
a902b31f9.log
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inode=517163 handle rotation(): /var/log/containers/csi-azuredisk-node-xzfb2_kube-system_node-driver-registrar-ea8140cf3e5505668afb091dffb903299d5b544621e8563ac8
ea811a902b31f9.log => /var/log/pods/kube-system_csi-azuredisk-node-xzfb2_cb5f8f38-348c-4ec8-9ec4-ded1fa515d25/node-driver-registrar/0.log (deleted)
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inode=530734 link_rotated: /var/log/containers/konnectivity-agent-784d99587d-wcc5b_kube-system_konnectivity-agent-154838022a48067296706945471789acbfe8ccbd16012bf
d99e64a38de32cc8c.log
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inode=516741 handle rotation(): /var/log/containers/konnectivity-agent-784d99587d-wcc5b_kube-system_konnectivity-agent-154838022a48067296706945471789acbfe8ccbd16
012bfd99e64a38de32cc8c.log => /var/log/pods/kube-system_konnectivity-agent-784d99587d-wcc5b_d3e0a095-99c8-471c-87f0-fc5c4dd15b7c/konnectivity-agent/0.log (deleted)
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inode=6193229 handle rotation(): /var/log/containers/infrastructure-argoproj-argocd-argocd-notifications-759b95xdpzk_argocd-system_argocd-notifications-24cd275c60792ea4d82138efc9ac542da7854377fe5817b24a8f0d8025c97ab7.log => /var/log/pods/argocd-system_infrastructure-argoproj-argocd-argocd-notifications-759b95xdpzk_5e9de73b-f0f1-4d77-a030-1e288adda833/argocd-notifications/0.log.20220727-072649
<suppress some lines>
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inode=517328 handle rotation(): /var/log/containers/collector-vssvf_stackrox_compliance-fa99f0772048faa1b4df45c80c502acc29fbce55f6c8ae6f4f4bc532296e939c.log => /var/log/pods/stackrox_collector-vssvf_832b5829-9afa-4b11-8843-f8ca5107b43e/compliance/75.log (deleted)
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=517247 watch_fd=10
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=517038 watch_fd=11
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=517163 watch_fd=12
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=516741 watch_fd=39
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=5942270 watch_fd=139
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=5942271 watch_fd=141
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=517326 watch_fd=143
[2022/07/27 09:32:52] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=517328 watch_fd=144
[2022/07/27 09:32:53] [ info] [input:tail:tail.6] inode=1111 handle rotation(): /var/log/syslog => /var/log/syslog.1
[2022/07/27 09:32:53] [ info] [input:tail:tail.6] inotify_fs_remove(): inode=1111 watch_fd=3
[2022/07/27 09:32:53] [ info] [input:tail:tail.6] inotify_fs_add(): inode=1111 watch_fd=4 name=/var/log/syslog.1
[2022/07/27 09:32:53] [ info] [input:tail:tail.5] inotify_fs_add(): inode=6210017 watch_fd=145 name=/var/log/containers/fluentbit-infrastructure-logging-fluentbit-sxvb8_logging_debugger-f6633-f026444b431b26ca6d3b55d9c46426e89928e85eeddf6b0eec5ac2f2e0263edb.log
[2022/07/27 09:32:54] [ info] [input:tail:tail.5] inotify_fs_add(): inode=516984 watch_fd=146 name=/var/log/containers/collector-vssvf_stackrox_compliance-ab4b188c158682fd3df24a5675a0f5ce2f3a2d12f4eb00e540778313122b2538.log
[2022/07/27 09:32:54] [ info] [input:tail:tail.5] inotify_fs_add(): inode=516987 watch_fd=147 name=/var/log/containers/collector-vssvf_stackrox_compliance-d5d9c7ab36071696bea366f259921fa7999ecfb28fd4d922fe96933278a35be6.log
[2022/07/27 09:32:54] [ info] [input:tail:tail.5] inotify_fs_add(): inode=6197597 watch_fd=148 name=/var/log/containers/csi-azuredisk-node-z655c_kube-system_azuredisk-93ba5873741ab6efb38e54064d1c71ccdd5d8baad11daefcf5b4d57090127663.log
[2022/07/27 09:32:54] [ info] [input:tail:tail.5] inotify_fs_add(): inode=6195639 watch_fd=149 name=/var/log/containers/csi-azuredisk-node-z655c_kube-system_liveness-probe-0eeafe6ed198595bb3e2012f7671f5fd58605b23b4f8c9a47524ef430dc4dc9c.log
[2022/07/27 09:32:54] [ info] [input:tail:tail.5] inotify_fs_add(): inode=6195660 watch_fd=150 name=/var/log/containers/csi-azuredisk-node-z655c_kube-system_node-driver-registrar-0ad508225e23ed8e483bf05fde6403c1dc845f3cdafd3ae386a98b9031b77a48.log
[2022/07/27 09:32:54] [ info] [input:tail:tail.5] inotify_fs_add(): inode=5943657 watch_fd=151 name=/var/log/containers/<omitted-pod_name>-7bfb8555d6-qnk8q_<omitted-pod_name>-57ca2ea37ae9c9635672f1aa650cd311cde2dfcf8c3ed7c7c93a57dc510fb51b.log
[2022/07/27 09:32:54] [ info] [input:tail:tail.5] inotify_fs_add(): inode=5943669 watch_fd=152 name=/var/log/containers/<omitted-pod_name>-7bfb8555d6-qnk8q_<omitted-pod_name>-b5874e2d02a0a9219e21f8d5e6192afbd5174d5269306f288339de5ce9ca150b.log
[2022/07/27 09:32:59] [ info] [input:tail:tail.6] inotify_fs_add(): inode=7313 watch_fd=5 name=/var/log/syslog
[2022/07/27 09:33:00] [ info] [input:tail:tail.6] inotify_fs_remove(): inode=1111 watch_fd=4
[2022/07/27 09:33:00] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=6193229 watch_fd=135
[2022/07/27 09:33:23] [ info] [input:tail:tail.5] inotify_fs_add(): inode=6193225 watch_fd=153 name=/var/log/containers/infrastructure-argoproj-argocd-argocd-notifications-759b95xdpzk_argocd-system_argocd-notifications-24cd275c60792ea4d82138efc9ac542da7854377fe5817b24a8f0d8025c97ab7.log

full file fluenti-bit.log When I go to check the metrics, it's showing that the logs are being processing and I can see them in my log centralized server. It looks like the fluent-bit stops processing when the log file is deleted. When the process is resuming it starts to process again.

Any idea what it could be? Or what I can do to fix it?

To Reproduce Unfortunately, I couldn't reproduce it locally or in my dev environment. It only happen in the prod environment I ensured that the dev has the same configuration as prod, but I couldn't reproduce it.

Expected behavior Even when the log files are deleted the fluent-bit can continue processing the logs. Or if not, through the liveness probe this is seen and the container is restarted.

Screenshots image The metrics are in UTC and the logs UTC +1.

Your Environment

Additional context To keep some security directve we rebuild the image to upgrade some packages with CVEs. Here is the Dockerfile DEBIAN_DISTROLESS: distroless/cc-debian11:latest FROM_RUNTIME: fluent/fluent-bit:1.9.5 FROM_INSTALLER: debian:bullseye-slim

ARG FROM_INSTALLER
ARG FROM_RUNTIME

ARG FROM_TAG
ARG DEBIAN_DISTROLESS

FROM ${FROM_INSTALLER} AS dependencies

USER root
WORKDIR /tmp

COPY setup_artifactory_mirror.sh .
RUN ./setup_artifactory_mirror.sh

<suppress some lines>

EXPOSE 2020
ENTRYPOINT [ "/fluent-bit/bin/fluent-bit" ]
CMD ["/fluent-bit/bin/fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.conf"]

full dockerfile Dockerfile.txt

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 5 days with no activity.

woodliu commented 1 year ago

We have the same issue in version:1.9.5

mktgbnk commented 1 year ago

@reverson-farfetch @woodliu were you able to figure out the fix? We experience the same issue in 1.9.5

woodliu commented 1 year ago

@reverson-farfetch @woodliu were you able to figure out the fix? We experience the same issue in 1.9.5

Just update to 1.9.6