Open qingling128 opened 5 years ago
agree, it will be implemented.
We had the same issue when migrating to newer fluentbit versions, our kubernetes metadata would completely go away. I've tried both 1.2.0 and 1.1.3 and both of these versions stop producing kubernetes metadata.
So we had to roll back to version fluent/fluent-bit:1.0.6
, which still produces the metadata :)
@devopsjonas I am interested into learn and discover the "why". would you please share your configmap ?
[SERVICE]
Flush 1
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
[INPUT]
Name tail
Tag kube-container.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
[INPUT]
Name tail
Tag kube-audit.*
Path /var/log/kube-api-server/audit
DB /var/log/flb_kube_audit.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
[INPUT]
Name systemd
Tag systemd.*
DB /var/log/flb_systemd.db
Path /run/log/journal
[INPUT]
Name tail
Tag ceph.*
Path /var/log/ceph/*.log
DB /var/log/ceph.db
[FILTER]
Name record_modifier
Match kube-container.*
Remove_key annotations
[FILTER]
Name kubernetes
Match kube-container.*
Kube_URL https://kubernetes.default.svc.cluster.local:443
Merge_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude true
[OUTPUT]
Name es
Match kube-container.*
Host ${FLUENT_ELASTICSEARCH_HOST}
Port ${FLUENT_ELASTICSEARCH_PORT}
HTTP_User ${FLUENT_ELASTICSEARCH_USER}
HTTP_Passwd ${FLUENT_ELASTICSEARCH_PASSWORD}
Logstash_Format On
Logstash_Prefix ${FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX}
Retry_Limit False
[OUTPUT]
Name es
Match kube-audit.*
Host ${FLUENT_ELASTICSEARCH_HOST}
Port ${FLUENT_ELASTICSEARCH_PORT}
HTTP_User ${FLUENT_ELASTICSEARCH_USER}
HTTP_Passwd ${FLUENT_ELASTICSEARCH_PASSWORD}
Logstash_Format On
Logstash_Prefix ${FLUENT_ELASTICSEARCH_AUDIT_LOGSTASH_PREFIX}
Retry_Limit False
Format json
[OUTPUT]
Name es
Match systemd.*
Host ${FLUENT_ELASTICSEARCH_HOST}
Port ${FLUENT_ELASTICSEARCH_PORT}
HTTP_User ${FLUENT_ELASTICSEARCH_USER}
HTTP_Passwd ${FLUENT_ELASTICSEARCH_PASSWORD}
Logstash_Format On
Logstash_Prefix ${FLUENT_ELASTICSEARCH_SYSTEMD_LOGSTASH_PREFIX}
Retry_Limit False
Format json
[OUTPUT]
Name es
Match ceph.*
Host ${FLUENT_ELASTICSEARCH_HOST}
Port ${FLUENT_ELASTICSEARCH_PORT}
HTTP_User ${FLUENT_ELASTICSEARCH_USER}
HTTP_Passwd ${FLUENT_ELASTICSEARCH_PASSWORD}
Logstash_Format On
Logstash_Prefix ${FLUENT_ELASTICSEARCH_CEPH_LOGSTASH_PREFIX}
Retry_Limit False
Format json
parsers.conf: |
[PARSER]
Name json
Format json
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
# Command | Decoder | Field | Optional Action
# =============|==================|=================
Decode_Field_As escaped log
[PARSER]
Name syslog
Format regex
Regex ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
Time_Key time
Time_Format %b %d %H:%M:%S
In your Kubernetes filter you need to configure Kube_Tag_Prefix, e.g:
Kube_Tag_Prefix kube-container.var.log.containers.
Adding kube-container.var.log.containers.
worked, thank you and sorry for the bother.
I found the release notes https://docs.fluentbit.io/manual/installation/upgrade_notes#fluent-bit-v-1-1 which explain this.
Somehow I completely missed it, when I was looking at the changelog :/
thanks, we are including a link to the upgrade notes on every change log now
@qingling128
you need to add the Kube_Tag_Prefix to your configmap (kubernetes filter ):
Kube_Tag_Prefix k8s_container.var.log.containers.
ref: https://docs.fluentbit.io/manual/installation/upgrade_notes#fluent-bit-v-1-1
@edsiper We've adjusted Kube_Tag_Prefix
already and still seeing this issue. For the reference our Kube_Tag_Prefix
is slightly different from standard. Below is the full config.
The issue we're seeing seem different from what devopsjonas@ saw. We do get metadata for most of the time. The silent failures happen intermittently. And it auto recovers. My hunch is that it's because the Kubernetes Master API might be unreachable intermittently. Instead of buffering the logs and retry later, the current behavior of the plugin is to just skip attaching metadata for those logs. As a result, when customer query for these logs based on the metadata, the logs "appeared" to be missing.
data:
# Configuration files for service, input, filter, and output plugins.
# ======================================================
fluent-bit.conf: |
[SERVICE]
# https://docs.fluentbit.io/manual/service
Flush 1
Log_Level warn
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
# https://docs.fluentbit.io/manual/configuration/buffering
storage.path /var/log/fluent-bit-buffers/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 10M
@INCLUDE input-containers.conf
@INCLUDE input-systemd.conf
@INCLUDE filter-kubernetes.conf
@INCLUDE output-fluentd.conf
input-containers.conf: |
[INPUT]
# https://docs.fluentbit.io/manual/input/tail
Name tail
Tag_Regex var.log.containers.(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
Tag k8s_container.<namespace_name>.<pod_name>.<container_name>
Path /var/log/containers/*.log
Parser docker
DB /var/log/fluent-bit-k8s-container.db
Buffer_Chunk_Size 512KB
Buffer_Max_Size 5M
Rotate_Wait 30
Mem_Buf_Limit 30MB
Skip_Long_Lines On
Refresh_Interval 10
storage.type filesystem
input-systemd.conf: |
[INPUT]
# https://docs.fluentbit.io/manual/input/systemd
Name systemd
Tag container_runtime
Path /var/log/journal
DB /var/log/fluent-bit-k8s-node-journald-docker.db
Systemd_Filter _SYSTEMD_UNIT=docker.service
storage.type filesystem
[INPUT]
# https://docs.fluentbit.io/manual/input/systemd
Name systemd
Tag kubelet
Path /var/log/journal
DB /var/log/fluent-bit-k8s-node-journald-kubelet.db
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
storage.type filesystem
[INPUT]
# https://docs.fluentbit.io/manual/input/systemd
Name systemd
Tag node-journal
Path /var/log/journal
DB /var/log/fluent-bit-k8s-node-journald.db
storage.type filesystem
[FILTER]
# https://docs.fluentbit.io/manual/filter/grep
Name grep
Match node-journal
Exclude _SYSTEMD_UNIT docker\.service|kubelet\.service
# We have to use a filter per systemd tag, since we can't match on distinct
# strings, only by using wildcards with a prefix/suffix. We can't prefix
# these (e.g. with "k8s_node.") since we need to output the records with the
# distinct tags, and Fluent Bit doesn't yet support modifying the tag (only
# setting it in the input plugin). We can revise once the feature request
# (https://github.com/fluent/fluent-bit/issues/293) is fulfilled.
[FILTER]
# https://docs.fluentbit.io/manual/filter/record_modifier
Name record_modifier
Match container_runtime
Record logging.googleapis.com/local_resource_id k8s_node.${NODE_NAME}
[FILTER]
# https://docs.fluentbit.io/manual/filter/record_modifier
Name record_modifier
Match kubelet
Record logging.googleapis.com/local_resource_id k8s_node.${NODE_NAME}
[FILTER]
# https://docs.fluentbit.io/manual/filter/record_modifier
Name record_modifier
Match node-journal
Record logging.googleapis.com/local_resource_id k8s_node.${NODE_NAME}
filter-kubernetes.conf: |
[FILTER]
# https://docs.fluentbit.io/manual/filter/kubernetes
Name kubernetes
Match k8s_container.*
Kube_URL https://kubernetes.default.svc.cluster.local:443
Kube_Tag_Prefix k8s_container.
Regex_Parser k8s-container-custom-tag
Annotations Off
output-fluentd.conf: |
[OUTPUT]
# https://docs.fluentbit.io/manual/input/forward
Name forward
Match *
Host stackdriver-log-aggregator-in-forward.kube-system.svc.cluster.local
Port 8989
Retry_Limit False
parsers.conf: |
[PARSER]
# https://docs.fluentbit.io/manual/parser/json
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep Off
# Command | Decoder | Field | Optional Action
# =============|==================|=================
Decode_Field_As json log do_next
Decode_Field_As escaped log
[PARSER]
Name k8s-container-custom-tag
Format regex
Regex ^(?<namespace_name>[^_.]+)\.(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*+)\.(?<container_name>[^.]+)$
Note that we've implemented a workaround to get rid of the errors we are seeing in https://github.com/fluent/fluent-bit/issues/1399#issue-459523045. Basically it was due to the fact that our Fluentd configuration has a hard dependency on the metadata attached by kubernetes filter plugin. We changed it to depend on the log file path instead, so that it's more reliable and does not cause spam the log.
We still experience missing metadata every now and then, but the impact is less concerning.
[Our new Fluentd config]
forward.input.conf: |-
<source>
# TODO: Secure this port. Bind to 127.0.0.1 if possible.
@type forward
port 8989
bind 0.0.0.0
</source>
<match k8s_container.**>
@type record_modifier
<record>
"logging.googleapis.com/local_resource_id" ${tag}
message ${record['log'].length > 100000 ? "[Trimmed]#{record['log'][0..100000]}..." : record['log']}
severity ${record['severity'] || if record['stream'] == 'stderr' then 'ERROR' else 'INFO' end}
_dummy_labels_ ${if record.is_a?(Hash) && record.has_key?('kubernetes') && record['kubernetes'].has_key?('labels') && record['kubernetes']['labels'].is_a?(Hash); then; record["logging.googleapis.com/labels"] = record['kubernetes']['labels'].map{ |k, v| ["k8s-pod/#{k}", v]}.to_h; end; nil}
_dummy_source_location_ ${if record.is_a?(Hash) && record.has_key?('source') && record['source'].include?(':'); then; source_parts = record['source'].split(':', 2); record['logging.googleapis.com/sourceLocation'] = {'file' => source_parts[0], 'line' => source_parts[1]} else; nil; end}
</record>
tag ${if record['stream'] == 'stderr' then 'stderr' else 'stdout' end}
remove_keys kubernetes,log,stream,_dummy_labels_,_dummy_source_location_
</match>
google-fluentd.conf: |-
@include config.d/*.conf
<system>
workers 10
root_dir /stackdriver-log-aggregator-persistent-volume
</system>
# Each worker binds to `port` + fluent_worker_id.
<source>
@type prometheus
port 24231
<labels>
worker_id ${worker_id}
</labels>
</source>
<source>
@type prometheus_monitor
<labels>
worker_id ${worker_id}
</labels>
</source>
<source>
@type prometheus_output_monitor
<labels>
worker_id ${worker_id}
</labels>
</source>
# Do not collect fluentd's own logs to avoid infinite loops.
<match fluent.**>
@type null
</match>
<match **>
@type google_cloud
@id google_cloud
# Try to detect JSON formatted log entries.
detect_json true
# Collect metrics in Prometheus registry about plugin activity.
enable_monitoring true
monitoring_type prometheus
# Allow log entries from multiple containers to be sent in the same
# request.
split_logs_by_tag false
<buffer>
# Set the buffer type to file to improve the reliability and reduce the
# memory consumption.
@type file
# The max size of each chunks: events will be written into chunks until
# the size of chunks become this size
# Set the chunk limit conservatively to avoid exceeding the recommended
# chunk size of 5MB per write request.
chunk_limit_size 512k
# Block processing of input plugin to emit events into that buffer.
overflow_action block
# The size limitation of this buffer plugin instance.
# In total 10 * 10 = 100GB.
total_limit_size 10GB
# Never wait more than 5 seconds before flushing logs in the non-error
# case.
flush_interval 5s
# Use multiple threads for flushing chunks.
flush_thread_count 10
# How output plugin behaves when its buffer queue is full
overflow_action drop_oldest_chunk
# This has to be false in order to let retry_timeout and retry_max_times
# options take effect.
retry_forever false
# Seconds to wait before next retry to flush.
retry_wait 5s
# The base number of exponential backoff for retries.
retry_exponential_backoff_base 5
# The maximum interval seconds for exponential backoff between retries
# while failing.
retry_max_interval 1h
# The maximum seconds to retry to flush while failing, until plugin
# discards buffer chunks.
retry_timeout 24h
# Wait seconds will become large exponentially per failures.
retry_type exponential_backoff
</buffer>
use_grpc true
project_id "{{.ProjectID}}"
k8s_cluster_name "{{.ClusterName}}"
k8s_cluster_location "{{.ClusterLocation}}"
adjust_invalid_timestamps false
# Metadata Server is not available in On-Prem world. Skip the check to
# avoid misleading errors in the log.
use_metadata_service false
</match>
Seems like it's still a problem for the following behavior though:
When the kubernetes filter plugin fails to query API to get metadata, it should either retry within the plugin, or error out so that the log entry can be put back in the queue and get re-processed again later. Right now, it seems to be failing silently. As a result, if customers filter by certain pod labels from kubernetes metadata, the logs that do not have the metadata would appear to be "missing".
To solve this problem, we need some retry logic.
Hi everyone! I have a similar problem with fluentbit:1.2.*
input-kubernetes.conf: |
[INPUT]
Name tail
Tag kube.<namespace_name>.
Tag_Regex ?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
Path /var/log/containers/frontend-*.log
Parser docker
DB /var/log/fluentbit/flb_kube.db
DB.Sync Full
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 60
storage.type filesystem
filter-kubernetes.conf: |
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
Merge_Log_Trim On
Regex_Parser kube-custom
tls.verify Off
output-myappname.conf: |
[OUTPUT]
Name kafka
Format json
Match kube.frontend-<myappname>.*
Brokers broker1:10091, broker2:10091, broker3:10091
Topics myappname_topic
Timestamp_Key timestamp_kafka
...
Another config for kafka
parsers.conf: |
[PARSER]
Name kube-custom
Format regex
Regex ^kube\.(?<tag_namespace>[a-z-]+)\.(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
And, after apply this configuration I not see smth metadata in my topic with logs. Could someone tell me, where I am mistaken or it is a bug in fluentbit? Moreover, I create routing between kafka topics (for each application dedicated topic, with help adding one more output.conf) and all work perfectly, but, without metadata
We are having exactly same issue than @qingling128. Have some type or retry logic would be awesome for us.
what would be the expected behavior if the API server is not reachable for a long period ? note that this will generate backpressure since pod logs will still come in and we need to have this data "somewhere", either memory or the file system, I see more complexity will be added...
ideas are welcome
@edsiper,
Yeah, it's a pretty tricky problem. I think there are two potential areas for improvements.
# Log warnings/errors Failures are inevitable from time to time. The silence (aka no logs in Fluent Bit that indicates something went wrong) is what caused confusion for customers and made it hard to debug. Because today some customers rely on these metadata to query log entries. When metadata is absent, the logs appeared to be missing for them. If there are Fluent Bit warning/error logs that indicate metadata is not attached successfully during certain time period (e.g. a server side outage), customers can at least proactively adjust their query filter for that time period.
# Retries Retrying forever would be problematic during a long-time server side outage. One thought is to retry with backoff but limit the maximum retry times (similar to https://docs.fluentd.org/configuration/buffer-section#retries-parameters).
The failures we saw so far that's worth retrying were mainly due to two reasons (the API should return different response codes):
1. Kubernetes master API downtime The intermittent Kubernetes master API downtime we see today normally don't last long. Yet Kubernetes master API downtime is expected to happen. During events like cluster upgrades, it can be down for minutes. Customer who follow best practice can potentially reach zero master downtime. In that case, when one master goes away, if we reconnect, it should always be able to connect to one available master.
2. Newly created pods don't have metadata available yet This is relatively rare compared to the other cause. If we retry after a few seconds, the metadata should normally be available.
Side notes: It's possible that under different use cases the importance of having kurerentes metadata attached is different. Some customers might want their logs to arrive without metadata if their workload does not rely on metadata. Some customers might prefer their logs to arrive with metadata even if the logs will be delayed. Some customization (either at Fluent Bit level or at pod annotation level) can be very useful.
Also, I think there's another bug. When kubernetes filter can't get data from kubernetes apiserver, the field kubernetes
is completely empty, although some metadata can be obtained from record's tag. At least that information (like pod_name
, namespace_name
, etc.) should be included if query against apiserver fails.
@arodriguezdlc are you using the latest v1.3.5 ?
@arodriguezdlc are you using the latest v1.3.5 ?
Yes
Running with aws-for-fluent-bit v2.10.0 (fluent Bit 1.6.8) and the issue still there.
I have one potential workaround for retrying the metadata once if a metadata request fails:
# Kubernetes filter
[FILTER]
# https://docs.fluentbit.io/manual/filter/kubernetes
Name kubernetes
Match k8s_container.*
Kube_URL https://kubernetes.default.svc.cluster.local:443
Annotations Off
# Begin kubernetes filter retry logic
# Lift the kubernetes namespace_name to root log level
[FILTER]
Name nest
Match k8s_container.*
Operation lift
Nested_under kubernetes
Add_prefix kubernetes.
# Add kubernetes key if it does not exist
[FILTER]
Name modify
Match k8s_container.*
Add kubernetes.namespace_name unique_random_xifaoithxlijtrhelijhst
# Retry all filters iff kubernetes.namespace_name does not exist
[FILTER]
Name rewrite_tag
Match k8s_container.*
Rule $kubernetes.namespace_name ^(unique_random_xifaoithxlijtrhelijhst)$ $TAG false
Emitter_Name re_emitted
# Remove lifted kubernetes key
[FILTER]
Name modify
Match k8s_container.*
Remove_wildcard kubernetes.
The rewrite tag is usually used for rewriting the name of a tag based on certain conditions. https://docs.fluentbit.io/manual/pipeline/filters/rewrite-tag
It has a side effect that may be beneficial for our specific case: all filters that are applied before the rewrite tag are retried.
I haven't tried the above config yet, but I think it may give a general outline of what might be needed to configure kubernetes retry without any code changes.
If you try the above and find that it works, let me know. Adjustments are most likely going to be needed. Please post a revised version of the retry config logic if you have it. This solution assumes: [FILTER] nest with operation lift will not trigger an error if the nest field does not exist.
It turns out to me that the default(32K) Buffer_Size
is the root cause, and we decided to set it to 0
since we have resource limits on the fluentbit pod.
Here is a example log from fluentbit:
[2022/05/31 16:10:40] [ warn] [http_client] cannot increase buffer: current=32000 requested=64768 max=32000
It would be great if the kubernetes
plugin could have some error log instead of a warning.
Because I came here looking around on how to solve my issue, let me put some more info here on possible reason to not have metadata. First the doc of k8s filter says about the buffer size:
Set the buffer size for HTTP client when reading responses from Kubernetes API server. The value must be according to the Unit Size specification. A value of 0 results in no limit, and the buffer will expand as-needed. Note that if pod specifications exceed the buffer limit, the API response will be discarded when retrieving metadata, and some kubernetes metadata will fail to be injected to the logs.
The default size is 32k
. So if you pod (think the yaml of your pod) is more than 32k, you will not have metadata for that pod. To change it, for example to 48k via helm chart, add this to the values.yaml:
Buffer_Size 48k
Or if you are running an older version of the chart:
extraEntries:
filter: |-
Buffer_Size 48k
For exploratory purposes... perhaps for assistance...
If you don't require additional metadata enriched from the Kubernetes API Server in your log events, you can remove the Kubernetes filter and get the metadata embedded in the log file name of your container's pod with regex.
The provided Fluent Bit configurations below will generate events with this format:
{
"date": 1704428569.177147,
"pod_name": "my-app-687f7d8bb5-vpz9q",
"namespace_name": "default",
"container_name": "my-container",
"log_file_path": "/var/log/containers/my-app-687f7d8bb5-vpz9q_default_my-container-c43c40ba9d128336cd01efd3d7430ccaee92ad5c2053bff0129473997dde2d39.log",
"log": "2024-01-05T04:22:49.17707094Z stdout F 127.0.0.1 - - [05/Jan/2024:04:22:49 +0000] \"GET /favicon.ico HTTP/1.1\" 200 769 \"http://127.0.0.1:8080/\" \"Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0\"",
"hostname": "release-name-fluent-bit-6jscz"
}
/fluent-bit/etc/custom_parsers.conf
[PARSER]
Name kubernetes_container_logs_metadata
Format regex
Regex (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
This regex is identical to the one utilized by the Kubernetes Filter. More details:
/fluent-bit/etc/fluent-bit.conf
[SERVICE]
Daemon Off
Flush 1
Log_Level info
Parsers_File /fluent-bit/etc/parsers.conf
Parsers_File /fluent-bit/etc/conf/custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Health_Check On
[INPUT]
Name tail
Path /var/log/containers/my*.log
Tag kube.*
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Path_Key log_file_path
[FILTER]
Name parser
Match *
Key_Name log_file_path
Parser kubernetes_container_logs_metadata
Preserve_Key On
Reserve_Data On
[FILTER]
Name record_modifier
Match kube.*
Record hostname ${HOSTNAME}
Whitelist_key hostname
Whitelist_key pod_name
Whitelist_key namespace_name
Whitelist_key container_name
Whitelist_key time
Whitelist_key log
Whitelist_key log_file_path
[OUTPUT]
Name stdout
Match *
Format json_lines
Bug Report
Describe the bug In Kubernetes, we are using a Log Forwarder (one Fluent Bit per node) + Log Aggregator (a few Fluentd instances per cluster) infrastructure to collect logs.
We are seeing the following errors in the Fluentd log:
After taking a closer look, it seems that the
kubernetes
filter plugin for Fluent Bit sometimes did not attach additional Kubernetes metadata fields as expected. As a result,record["kubernetes"]["namespace_name"]
is not reliable to be present.Expected behavior If the
kubernetes
filter plugin fails to query API to get metadata, it should either retry within the plugin, or error out so that the log entry can be put back in the queue and get re-processed again later. Right now, it seems to be failing silently and just passing the log record onto theforward
output plugin without attaching the expected metadata.Versions Fluent Bit v1.1.3 Fluentd v1.4.2 fluent-plugin-record-modifier Plugin v2.0.1 Kubernetes v1.12.7
Fluent Bit Configuration
Fluentd Configuration
Any suggestion / workaround is welcome as well as long as there is a way for us to force it to retry and make sure the
kubernetes
metadata is always present.