fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.79k stars 1.57k forks source link

Gzip Decompression Failure Due to 100MB Limit in Fluent Bit 3.0.7 #9058

Open aydosman opened 3 months ago

aydosman commented 3 months ago

Bug Report

I'm encountering an issue with Fluent Bit where the gzip decompression fails due to exceeding the maximum decompression size of 100MB. Below are the relevant error logs and configurations for both the collector and aggregator.

To Reproduce

Example log message

[2024/07/08 08:05:26] [error] [gzip] maximum decompression size is 100MB
[2024/07/08 08:05:26] [error] [input:forward:forward.0] gzip uncompress failure
[2024/07/08 08:05:52] [error] [gzip] maximum decompression size is 100MB
[2024/07/08 08:05:52] [error] [input:forward:forward.0] gzip uncompress failure
[2024/07/08 08:06:08] [error] [gzip] maximum decompression size is 100MB
[2024/07/08 08:06:08] [error] [input:forward:forward.0] gzip uncompress failure
[2024/07/08 08:06:20] [error] [gzip] maximum decompression size is 100MB
[2024/07/08 08:06:20] [error] [input:forward:forward.0] gzip uncompress failure

Steps to reproduce the problem

Set up Fluent Bit with the provided collector and aggregator configurations.

Monitor the logs for gzip decompression errors.

Expected behavior

Fluent Bit should handle the gzip decompression without exceeding the maximum decompression size limit.

Screenshots

N/A

Your Environment

Version used: Fluent Bit 3.0.7

Configuration:

Collector Configuration:

[SERVICE]
    daemon false
    log_level warn
    storage.path /var/fluent-bit/state/flb-storage/
    storage.sync normal
    storage.max_chunks_up 32
    storage.backlog.mem_limit 32MB
    storage.metrics true
    storage.delete_irrecoverable_chunks true
    http_server true
    http_listen 0.0.0.0
    http_Port 2020

[INPUT]
    name tail
    path /var/log/containers/*.log
    tag_regex (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    tag kube.<namespace_name>.<pod_name>.<container_name>
    read_from_head true
    multiline.parser cri
    skip_long_lines true
    skip_empty_lines true
    buffer_chunk_size 32KB
    buffer_max_size 32KB
    db /var/fluent-bit/state/flb-storage/tail-containers.db
    db.sync normal
    db.locking true
    db.journal_mode wal
    storage.type filesystem

[OUTPUT]
    name forward
    match *
    host fluent-bit-aggregator.observability.svc.cluster.local
    port 24224
    compress gzip
    workers 2
    retry_limit false
    storage.total_limit_size 16GB

Aggregator Configuration:

[SERVICE]
    daemon false
    log_level warn
    storage.path /fluent-bit/data
    storage.sync full
    storage.backlog.mem_limit 128M
    storage.metrics true
    storage.delete_irrecoverable_chunks true
    storage.max_chunks_up 64
    http_server true
    http_listen 0.0.0.0
    http_Port 2020

[INPUT]
    name forward
    listen 0.0.0.0
    port 24224
    buffer_chunk_size 1M
    buffer_max_size 4M
    storage.type filesystem

[OUTPUT]
    name loki
    match *
    host loki-gateway.logging.svc.cluster.local
    port 80
    line_format json
    auto_kubernetes_labels false
    label_keys $cluster, $namespace, $app
    storage.total_limit_size 16GB

Environment name and version (e.g. Kubernetes? What version?)

Kubernetes 1.30, 1.29, 1.28

Server type and version

AKS/EKS

Operating System and version

Ubuntu, AL2, AL2023 and BottlerocketOS

Filters and plugins

See above

Additional context

This issue persists across all Fluent Bit instances with the same configuration. Both collector and aggregator are using the same Fluent Bit version (3.0.7). The rate of records processed per second is consistently around 800 (so not too much). Any guidance or solution to resolve this issue would be greatly appreciated.

edsiper commented 3 months ago

just curious, what's the use case where one payload might expand to over 100MB ?

today that's a hard limit, we will need to extend it per component, besides in_forward is being used in other areas in your use case ?

aydosman commented 2 months ago

Could it be down to back pressure on the collector side, let me try and prove that. I’ll run some simulations and provide all the related metrics.

in_forward is being used in other areas in your use case ?

Not at this time

mirko-lazarevic commented 2 months ago

@edsiper I experience the same issue with the Fluent Bit version 3.0.4, however using the same configuration with the Fluent Bit version 2.2.2 we don't encounter this error. I believe, although I might be wrong that the error was introduced with this change https://github.com/fluent/fluent-bit/pull/8665

FYI: @cosmo0920

stevehipwell commented 2 months ago

@edsiper has this been investigated?

cosmo0920 commented 2 months ago

Hi, I'm trying to add full width confirmation of concatenated gzip stream of forwarded payloads in https://github.com/fluent/fluent-bit/pull/9139. Would you mind if you tried to test that patch?

stevehipwell commented 2 months ago

@cosmo0920 is there an OCI image built as part of the PR?

cosmo0920 commented 2 months ago

No. I tried to generate PR specific images. But no luck.

stevehipwell commented 1 month ago

Has this been fixed in v3.1.5?

aydosman commented 1 month ago

fb version 3.1.5 – Bug still exists fb version 3.1.6 – Bug still exists

To add and to prove a theory we had, the data we send/persist/DBs from the collector and aggregator might have been somehow corrupted, so these were tested on fresh new cloud nodes.

cosmo0920 commented 1 month ago

fb version 3.1.5 – Bug still exists fb version 3.1.6 – Bug still exists

To add and to prove a theory we had, the data we send/persist/DBs from the collector and aggregator might have been somehow corrupted, so these were tested on fresh new cloud nodes.

Do you have a reproducible step?

aydosman commented 1 month ago

fb version 3.1.5 – Bug still exists fb version 3.1.6 – Bug still exists To add and to prove a theory we had, the data we send/persist/DBs from the collector and aggregator might have been somehow corrupted, so these were tested on fresh new cloud nodes.

Do you have a reproducible step?

The configuration shown above has not changed only the Fluent Bit container image version has been updated. Let me know if you need anything else.

aydosman commented 1 week ago

Any update on this issue?