Open rawahars opened 2 years ago
Based on my debugging, I concluded that during the failure mode i.e. using async mode, we are following the codepath here. This calls flb_net_getaddrinfo
which in turn calls c-ares
package for DNS resolution.
In the happy path when using plugins with async mode disabled, the codepath followed is this one. This calls getaddrinfo
API for the DNS resolution.
I also tried to compile the plugins by setting async mode disabled using upstream->flags &= ~(FLB_IO_ASYNC);
. The plugins then work properly with DNS resolution working via secondary DNS server.
Yes, due to some constraints in the original design the async dns client aborts after the first resolution error and a refactor is on the way but we don't have an ETA yet. If you are interested in contributing let me know, I wrote the code so I can probably help you if you have any questions.
@leonardo-albertovich How come the setting I see here doesn't fix it: https://github.com/fluent/fluent-bit/blob/1.9/src/flb_upstream.c#L43
Does that not do anything?
What that setting does is select between using c-ares which is asynchronous and the default system resolver which is synchronous. If having those DNS queries block is something you can accept then it should be fine and if you need to minimize the overhead you can use a non authoritative local caching DNS server like most modern distributions do (which in the end would be the same for both async and sync since the real query wouldn't be performed by fluent-bit).
So we determined that setting this is a valid workaround for the problem:
net.dns.mode LEGACY
For Windows container users, all outputs will need this option set.
I am wondering if we could consider contributing a new environment variable that is like a global setting for net.dns.mode
or a Service level setting. Something to provide a better user experience so that the config is only set once.
We could hide this new setting or env var behind a new CMake flag which would default to Off
. So for example, I can enable it in the AWS distro, but if other community does not want to use the setting, by default it won't be built.
net.dns.mode
can be set in the [SERVICE]
section and overridden on a per plugin basis if desired, all of the DNS settings support that.
Why is this flagged as windows ? I had the same problem on flatcar, so I'm assuming that this is valid from any OS, right ?
Bug Report
Describe the bug
In fluent-bit, there can be plugins which are using async mode for performance improvement. This is the default setting and would be used by a lot of plugins.
Consider a scenario wherein the host has multiple DNS servers
[x.x.x.x, y.y.y.y]
such that the the first/primary DNS server(x.x.x.x)
does not resolve the endpoint but the secondary DNS Server(y.y.y.y)
resolves it correctly. In such cases, the plugins using async mode try the DNS resolution with the primary DNS only and fail without ever trying resolution with secondary DNS.The error for the same is-
This is in contrast to how DNS resolution should happen. The expected behaviour is for resolution to be tried using all the servers in DNS Server list before conceding error.
Note: This scenario works perfectly for plugins wherein async mode is disabled.
To Reproduce
Steps to reproduce the problem:
The issue can be replicated by following the listed steps-
Create a new Linux VM or just create a new container using
cr.fluentbit.io/fluent/fluent-bit:1.9.6-debug
Inside the container or VM, change the DNS setting to include an invalid DNS Server as the primary DNS server.
Start fluent-bit with http output plugin to send logs to a remote server.
Using
http://www.google.com:443
is the easiest repro of the issue as fluent-bit is unable to perform DNS resolution before any other thing happens. The same issue is applicable when using Kinesis Streams and Kinesis FIrehose plugins as well.We could not use HTTP benchmarking server on localhost as we need to use DNS resolution for the same. The same can be set on another machine and used here to replicate the issue.
Expected behavior
DNS resolution should happen with the secondary DNS Server before erroring out.
For the first example (www.google.com), the logs cannot be sent since it is not a valid destination and therefore, we will get an error 405 from the Google server. This essentially means that the DNS resolution worked fine.
Screenshots
Your Environment
Additional context