fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.85k stars 1.58k forks source link

No upstream connections available after time out #5424

Closed DutchEllie closed 2 years ago

DutchEllie commented 2 years ago

Bug Report

After deploying fluent-bit using Helm on my Kubernetes cluster I get errors when trying to export to a Graylog server using the GELF output.

To Reproduce

......
[2022/05/06 12:57:56] [error] [output:gelf:gelf.0] no upstream connections available
[2022/05/06 12:57:57] [error] [upstream] connection #63 to x.x.x.x:12201 timed out after 10 seconds
[2022/05/06 12:57:57] [error] [output:gelf:gelf.0] no upstream connections available
[2022/05/06 12:57:59] [error] [upstream] connection #72 to x.x.x.x:12201 timed out after 10 seconds
[2022/05/06 12:57:59] [error] [upstream] connection #65 to x.x.x.x:12201 timed out after 10 seconds
[2022/05/06 12:57:59] [error] [output:gelf:gelf.0] no upstream connections available
[2022/05/06 12:57:59] [error] [output:gelf:gelf.0] no upstream connections available
[2022/05/06 12:58:02] [error] [upstream] connection #64 to x.x.x.x:12201 timed out after 10 seconds
[2022/05/06 12:58:02] [error] [upstream] connection #68 to x.x.x.x:12201 timed out after 10 seconds
[2022/05/06 12:58:02] [error] [output:gelf:gelf.0] no upstream connections available
[2022/05/06 12:58:02] [error] [output:gelf:gelf.0] no upstream connections available
......

Expected behavior I expect the log messages to be delivered to Graylog.

Your Environment

serviceAccount: create: true name: fluent-bit

rbac: create: true nodeAccess: false

service: labels: k8s-app: fluent-bit-logging version: v1 kubernetes.io/cluster-service: "true" annotations: prometheus.io/scrape: "true" prometheus.io/port: "2020" prometheus.io/path: /api/v1/metrics/prometheus

config: service: | [SERVICE] Flush 5 Log_Level error Daemon off Parsers_File parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port {{ .Values.metricsPort }} Health_Check On

    #    @INCLUDE input-kubernetes.conf
    #    @INCLUDE filter-kubernetes.conf
    #    @INCLUDE output-graylog.conf

inputs: | [INPUT] Name tail Tag kube. Path /var/log/containers/.log Parser docker DB /var/log/flb_kube.db Mem_Buf_Limit 5MB Refresh_Interval 10

filters: |

Enrich with kubernetes properties

[FILTER]
    Name                    kubernetes
    Match                   kube.*
    Merge_Log_Key           log
    Merge_Log               On
    Keep_Log                Off
    Annotations             Off
    Labels                  Off
[FILTER]
    Name                    modify
    Match                   *
    Add                     kubernetes_cluster_name my-cluster

# Drop gitlab-managed-apps
[FILTER]
    Name       grep
    Match      kube.*
    Exclude    $kubernetes['namespace_name'] gitlab\-managed\-apps

# Flatten context
[FILTER]
    Name                    nest
    Match                   kube.*
    Operation               lift
    Nested_under            log

# Set 'message' property if no context exists
[FILTER]
    Name modify
    Match *
    Condition Key_Does_Not_Exist message
    Rename log message

outputs: | [OUTPUT] Name gelf Match kube.* Host x.x.x.x Port 12201 Mode tcp Gelf_Short_Message_Key message

extraFiles: parsers.conf: | [PARSER] Name docker Format json Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%L Time_Keep Off



* Environment name and version: Kubernetes 1.22 running on EKS
* Server type and version: EC2 t3.large. Kubelet version v1.22.6-eks-7d68063
* Operating System and version: Linux 5.4.188-104.359.amzn2.x86_64
* Filters and plugins: See entire config pasted above.

**Additional context**
<!--- How has this issue affected you? What are you trying to accomplish? -->
I was trying to export logs from Kubernetes to our central Graylog server. This Graylog server is hosted on AWS as well, exposed to the internet. Testing GELF input there works perfectly when I try to use `echo` and `ncat` on my laptop. Docker on the Graylog server also happily exports to the Graylog server locally, so Graylog works fine (I think).  
The configuration above is my Helm configuration for the [Helm chart](https://github.com/fluent/helm-charts/tree/main/charts/fluent-bit). Nothing except the exporting IP address has been left out. This configuration is also pretty much a carbon copy of the [GELF configuration example](https://github.com/fluent/helm-charts/tree/main/charts/fluent-bit).  
The listed error messages in the log are repeated constantly and forever while the application runs.  
The readiness probe (checking for a 200 on pod:2020/api/v1/health) returns a 500, but no more useful information.  
An almost exactly configuration is used by our company already, but it is not deployed using Helm. The goal is to modernize that configuration and port it to Helm. That configuration works, it's already in use by the company.  

I have tried everything I can think of, including but not limited to:
- Using UDP
- Using TLS
- Using a DNS hostname instead of IP
- Using a different (absolutely known to work) Graylog server
- Reverting to complete default configuration (with and without output to graylog)
- Using a different port
- Restarting the node / pod
- Reinstalling the helm release
- And more.... It's been a long day, I can't remember everything I did. It was a lot
<!--- Providing context helps us come up with a solution that is most useful in the real world -->

If anyone knows what this might be, I am so grateful
DutchEllie commented 2 years ago

Okay, I am so sorry for even bothering anyone to read my message. I figured out what it was, something I really never thought of until I accidentally stumbled across it. Turns out my cluster's AWS security group did not allow outbound traffic, which is why the connection never worked....