Closed qingling128 closed 3 years ago
The symptom [debug] [upstream] connection #NNN failed to logging.googleapis.com:443
originates in a tls handshake error in tls_net_handshake
which is triggered by a broken connection (errno = EPIPE).
However, according to the latest findings the issue could be related to fluent-bit accepting an IPV6 result for the DNS query when the system does not have an IPV6 address configured. The reason why this does not result in an error in net_connect_async
is still unknown but so far evidence strongly backs this theory.
If you want to test this in your own system the easiest way to do it would be adding the following line to your /etc/hosts
file :
173.194.213.95 logging.googleapis.com
I am also seeing this logs when trying to push logs to DataDog. Fluent Bit version: 1.8.9 Even the retry doesn't seem to work.
Is is possible to verify if the problem relates to DNS?
Is there a workaround in place?
I am also seeing this logs when trying to push logs to DataDog. Fluent Bit version: 1.8.9 Even the retry doesn't seem to work.
Is is possible to verify if the problem relates to DNS?
Is there a workaround in place?
You have two ways to confirm that this issue is what's' affecting you at the moment :
Add an entry in your /etc/hosts file with the ip / host mapping for the endpoint (I think it's http-intake.logs.datadoghq.com and the IP I got for it is 3.233.146.17)
Clone and build the PR that's linked to this issue where I fixed it
Don't hesitate to contact me in the fluent slack server if you have any questions.
@leonardo-albertovich Thanks for the hints. During my tests today I didn't encounter any connection issues. If they pop up again I will try the first thing.
PRs in progress to fix this:
Some errors I met after upgrading to v1.8.9:
First issue:
[2021/11/10 16:20:09] [ warn] [engine] failed to flush chunk '1-1636561198.640695704.flb', retry in 42 seconds: task_id=1, input=tail.0 > output=stackdriver.0 (out_id=0)
[2021/11/10 16:20:09] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=11): Could not contact DNS servers
Second issue is similar to https://github.com/fluent/fluent-bit/issues/4120 but I set the buffer_limit to 0 in kubernetes filter and it didn't work for me.
I am guessing whether we need to set 4192
here to 0? https://github.com/fluent/fluent-bit/blob/431459122841c4600abe6e384fcfb56e5967b276/plugins/out_stackdriver/stackdriver.c#L2153
@leonardo-albertovich ,
I have Fluent-bit running as a DaemonSet in K3s.
I switched to the 1.8.x-debug
image and within a terminal in the running container I ran nslookup
:
/ # nslookup logging.googleapis.com
Server: 10.43.0.10
Address: 10.43.0.10:53
Non-authoritative answer:
Name: logging.googleapis.com
Address: 2404:6800:4004:813::200a
*** Can't find logging.googleapis.com: No answer
/ # nslookup -type=a logging.googleapis.com
Server: 10.43.0.10
Address: 10.43.0.10:53
Name: logging.googleapis.com
Address: 142.251.42.138
/ # nslookup -type=aaaa logging.googleapis.com
Server: 10.43.0.10
Address: 10.43.0.10:53
Non-authoritative answer:
Name: logging.googleapis.com
Address: 2404:6800:4004:81d::200a
/ # nslookup logging.googleapis.com
Server: 10.43.0.10
Address: 10.43.0.10:53
Name: logging.googleapis.com
Address: 142.251.42.138
*** Can't find logging.googleapis.com: No answer
After that fluentbit was able to connect to Stackdriver - and after a POD restart the DNS issue was back.
The issue is not specific to 1.8.9
. I observed the same behaviour in: 1.8.3
and 1.8.4
.
I've worked around the issue by adding dnsConfig to the DaemonSet.
dnsConfig:
nameservers:
- 1.1.1.1
- 8.8.8.8
options:
- name: ndots
value: "1"
I am not completely sure whether the issues are related but as I said, the symptoms (error messages) are the same.
@fabito the issue reported by @qingling128 was related to a specific GCE detail (instances cannot connect to some hosts through the external IPv6 address) which was exposed because of how the DNS client incorporated in 1.8.5 produced the results which differs from how the system produces them.
Regardless of that, the underlying issue is in the connection error detection and handling mechanisms which is what the related PR fixes.
In your case if I understood you correctly you are not running in GCE and you actually do have DNS issues if you don't override the default ones (DHCP?) with those which means the issue is different.
What do you think?
In your case if I understood you correctly you are not running in GCE and you actually do have DNS issues if you don't override the default ones (DHCP?) with those which means the issue is different.
You are right I am not in GCE. What intrigued me is the fact that FluentBit is the only workload that needed that DNS tweak - we have other workloads (within the same k3s instance) that connect to other google apis (storage, pubsub, etc) which are working just fine.
In https://github.com/fluent/fluent-bit/pull/4295 you mentioned the order the DNS answers arrive.
You can notice in the nslookup
executions I cited above that when -type
is not specified nslookup seems to randomly pick either AAAA
or A
as its first query (and it always fails in the second query).
Anyway, Fluentbit probably has its own DNS client implementation (decoupled from nslookup) but, still, I thought it could be helpful to share my findings :-)
You are right, fluent-bit does not use the system resolver and that's the origin of the issue in this case, I'd like to know more of your particular case so it would be really helpful if you could clone and build the branch from PR #4295
If you want to try to get to the bottom of it feel free to contact me in the fluent slack so we can determine if yours is a different problem and what the root cause for it is (if that's the case).
Thanks for chiming in!
Edit : My name in slack is Leonardo Almiñana
Bug Report
Describe the bug When Fluent Bit
1.8.9
first restarts to apply configuration changes, we are seeing spamming errors in the log like:When we enabled debug logging, it shows that the error is a result of failure to connect to the backend.
But the same issue does not happen with Fluent Bit
1.8.4
.To Reproduce
We were testing with GCE VMs.
Step 1. Create a VM
Step 2. In the VM, install Fluent Bit and start / restart it.
// For my case, the fluent bit binary comes with ops agent install. But it should be reproducible with just regular Fluent Bit binary.
// Ensure nothing is in the buffer and no prior fluent bit process was running
// start and stop Fluent Bit
/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit
is Fluent Bit1.8.9
// start Fluent Bit again
Step 3. Observe warnings in the log
** Full Fluent Bit log fluent-bit-trimmed.log
** Fluent Bit Metrics
Expected behavior
Fluent Bit resumes log ingestion smoothly after restarts.
Actual behavior Lots of failed chunks. Further investigation revealed that some of the bad connections never recover.
Your Environment
Version used: Fluent Bit 1.8.9 (The same issue does not exist in 1.8.4)
Configuration:
[SERVICE] Daemon off Flush 1 HTTP_Listen 0.0.0.0 HTTP_PORT 2020 HTTP_Server On Log_Level info storage.backlog.mem_limit 50M storage.checksum on storage.max_chunks_up 128 storage.metrics on storage.sync normal
[INPUT] Buffer_Chunk_Size 512k Buffer_Max_Size 5M DB ${buffers_dir}/default_pipeline_syslog Key message Mem_Buf_Limit 10M Name tail Path /var/log/messages,/var/log/syslog Read_from_Head True Rotate_Wait 30 Skip_Long_Lines On Tag default_pipeline.syslog storage.type filesystem
[FILTER] Add logName syslog Match default_pipeline.syslog Name modify
[FILTER] Emitter_Mem_Buf_Limit 10M Emitter_Storage.type filesystem Match default_pipeline.syslog Name rewrite_tag Rule $logName .* $logName false
[FILTER] Match syslog Name modify Remove logName
[OUTPUT] Match_Regex ^(syslog)$ Name stackdriver Retry_Limit 3 resource gce_instance stackdriver_agent Google-Cloud-Ops-Agent-Logging/2.6.0 (BuildDistro=sles15;Platform=linux;ShortName=sles;ShortVersion=15-SP2) tls On tls.verify Off workers 8