fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.87k stars 1.59k forks source link

Kubernetes container logs missing after update to the 3.1.6 docker image #9284

Open diresqrl opened 2 months ago

diresqrl commented 2 months ago

Bug Report

Describe the bug After attempting to updating the docker image used for fluent-bit in our k8s clusters to 3.1.6, with no additional changes to the configuration, I noticed that our tail input for k8s container logs stopped functioning. We have a second tail input in our deployed config that is looking at other logs on the same mount/PVC that is working properly. Rolling back to 3.1.4 fixes the issue, and both 3.1.5 and 3.1.6 resulted in the same problem.

Debug logs show that the log files for input_containers are being identified on the filesystem and watched, but log lines are never being processed. Exported prometheus metrics also show that going from 3.1.4 to 3.1.6 the input total metric only on that one tail input drops to 0, but metrics for files closed and rotated continue to tick up as expected. Setting Inotify_Watcher to false has no impact.

In attempts to rule out potential environment-specific oddities with our k8s clusters, I was also able to reproduce this in a local k8s setup using kind. I've also systematically removed all extraneous inputs, filters, and outputs from our fluent-bit config in an attempt to isolate the problem.

To Reproduce

Expected behavior Container logs continue to process through fluent-bit without changing configuration.

Your Environment

Additional context This initially started as routine maintenance to get us onto the latest release series (previously on 2.1 series). Updating to 3.1.4 "checks the box", but we're now in an unfortunate predicament for future updates. I was convinced that it's an environment-specific issue, but now that I've been able to consistently reproduce with kind it felt appropriate to escalate in case others are facing the same issue.

edsiper commented 2 months ago

@diresqrl thanks for reporting the problem

I am looking at the changes on the components in v3.1.5 where its not working for you and we have this:

release v3.1.5

ff651842b in_tail: fix double-free on exception (CID 507963)

^ this is just a small fix when a memory allocation fails, should not be related.

other not-related changes:

bec603400 log_event_decoder: updated code to use aligned memory reads 7f037486e core: added aligned memory read functions f2f6b1d80 core: added a byte order detection abstraction macro f9e6def57 build: added an option to enforce memory alignment 68931d121 in_exec_wasi: Provide configurable stack and heap sizes for Wasm 03423cfa0 filter_wasm: Provide configurable heap and stack sizes for Wasm 3045fd6be wasm: Make configurable heap and stack sizes with a struct type f54b370cd out_stackdriver: fix leak on exception (CID 508239) e84d9ff94 out_kafka: add missing initialization (CID 507783) 54de999c6 in_forward: fix leak on exception (CID 507786) 1fc6eedfd in_emitter: fix use-after-free on exception (CID 507860) b26d85b13 in_forward: fix leak on exception (CID 508219) f5794a417 out_prometheus_exporter: Handle multiply concatenated metrics type of events (#9122) 13ea609e5 cmake: windows: Enable Kafka plugins on Windows a6980efcf appveyor: Use vcpkg to install the latest OpenSSL 725640616 workflows: add sanity check for compilation using system libraries to pr-compile-check.yaml 999e9b837 build: use the system provided LuaJIT if found c683d8e3c out_opensearch: fixed wrong payload buffer usage for traces 30b6522b1 restore --staged tests/internal/aws_util.c 9392bc112 lib: ctraces: upgrade to v0.5.4 c147d452b lib: cmetrics: upgrade to v0.9.3 56ff251d3 in_mqtt: added buffer size setting and fixed a leak (#9163) a86dceed2 lib: cfl: upgrade to v0.5.2 d58f2336b workflows: update unstable nightly builds for 3.0 (#9168) e19b2ab14 out_oracle_log_analytics: set NULL to prevent double free 3a37eb8f6 out_oracle_log_analytics: fix mk_list cleanup function 7ce4aa6a0 out_oracle_log_analytics: add flb_sds_destroy for key 209095d69 out_oracle_log_analytics: remove flb_errno that checks NULL c15ca2fba test: internal: gzip: Add testcases for payloads of concatenated gzip a4956ccba in_forward: Use extracted function for processing concatenated gzip 87ea26d65 gzip: Extract and unify code for concatenated gzip payloads a6aac459d in_node_exporter_metrics: Align the collecting metrics of unit statuses (#9134) 3d4ad3173 filter_log_to_metrics: add new option discard_logs and code cleanup 8f0317f77 workflows: Fix CentOS7 build failure for EPEL (#9157) bc0768600 in_kubernetes_events: add chunked streaming test 2929a3d46 in_kubernetes_events: fix end of chunked stream deadlock 57c04a5e0 build: libraries: update path to c-ares 6a293f7e7 lib: c-ares: ugprade to v1.32.3 f1d99ca27 out_s3: Plug memory leaks on gzipped buffer during the swapping contents 6df1f2bf7 workflows: bump ossf/scorecard-action from 2.3.3 to 2.4.0 (#9137)

hmm I don't have a clue of what could be the problem. are you able to provide Fluent Bit logs ?

reneeckstein commented 2 months ago

I can confirm we have the same problem after upgrading from fluent-bit 3.1.2 to 3.1.6. I can also confirm that the issue started in 3.1.5.

If I remove the log_to_metrics filter it works again.

    [FILTER]
        name               log_to_metrics
        match              kube.*
        tag                log_counter_metric
        metric_mode        counter
        metric_name        kubernetes_messages
        metric_description This metric counts Kubernetes messages
        kubernetes_mode    true

@diresqrl do you also use this filter?

maybe it has something to do with this new feature [Log_To_Metrics (Filter)] Add new option discard_logs https://github.com/fluent/fluent-bit/pull/9150/files I even tried to explicitly set discard_logs false but without success.

diresqrl commented 2 months ago

@reneeckstein After reviewing the way our configuration is being built, yes, we do have the log_to_metrics filter. I apparently missed that in my previous testing. I was also able to confirm that removing that filter does make log processing function again both in the kind/local environment and in our live k8s clusters.

aligthart commented 2 months ago

We experience the same. We were upgrading from 3.0.7 to 3.1.6 as that would hopefully resolve a small memory leak that caused our fluentbit PODs to crash over time. See this: https://github.com/fluent/fluent-bit/issues/9189

However no logs were forwarded anymore from the tail input plugin. 3.1.4 does not have this forwarding issue. (We use the forward output plugin) Note that we also use the log_to_metrics filter

reneeckstein commented 2 months ago

As far as I can see (on one cluster) in fluent-bit v3.1.7 the issue seems to be solved already. Probably related to this one -> https://github.com/fluent/fluent-bit/pull/9252

This PR addresses an issue where the incorrect data type was used for the mode option which caused the config map handler to overwrite the adjacent fields (discard_logs at least) in 64 bit systems.

diresqrl commented 2 months ago

I can confirm that the 3.1.7 update does fix this problem for me, at least in immediate testing.

aligthart commented 2 months ago

same here. 3.1.7 forwards logs again.