fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.73k stars 1.56k forks source link

dns resolution for plugins using async mode does not consider all the DNS servers on the host #5862

Open rawahars opened 2 years ago

rawahars commented 2 years ago

Bug Report

Describe the bug

In fluent-bit, there can be plugins which are using async mode for performance improvement. This is the default setting and would be used by a lot of plugins.

Consider a scenario wherein the host has multiple DNS servers [x.x.x.x, y.y.y.y] such that the the first/primary DNS server (x.x.x.x) does not resolve the endpoint but the secondary DNS Server (y.y.y.y) resolves it correctly. In such cases, the plugins using async mode try the DNS resolution with the primary DNS only and fail without ever trying resolution with secondary DNS.

The error for the same is-

[ warn] [net] getaddrinfo(host='www.google.com', err=12): Timeout while contacting DNS servers

This is in contrast to how DNS resolution should happen. The expected behaviour is for resolution to be tried using all the servers in DNS Server list before conceding error.

Note: This scenario works perfectly for plugins wherein async mode is disabled.

To Reproduce

Expected behavior

DNS resolution should happen with the secondary DNS Server before erroring out.

For the first example (www.google.com), the logs cannot be sent since it is not a valid destination and therefore, we will get an error 405 from the Google server. This essentially means that the DNS resolution worked fine.

Screenshots

Your Environment

Additional context

rawahars commented 2 years ago

Based on my debugging, I concluded that during the failure mode i.e. using async mode, we are following the codepath here. This calls flb_net_getaddrinfo which in turn calls c-ares package for DNS resolution.

In the happy path when using plugins with async mode disabled, the codepath followed is this one. This calls getaddrinfo API for the DNS resolution.

I also tried to compile the plugins by setting async mode disabled using upstream->flags &= ~(FLB_IO_ASYNC);. The plugins then work properly with DNS resolution working via secondary DNS server.

leonardo-albertovich commented 2 years ago

Yes, due to some constraints in the original design the async dns client aborts after the first resolution error and a refactor is on the way but we don't have an ETA yet. If you are interested in contributing let me know, I wrote the code so I can probably help you if you have any questions.

PettitWesley commented 2 years ago

@leonardo-albertovich How come the setting I see here doesn't fix it: https://github.com/fluent/fluent-bit/blob/1.9/src/flb_upstream.c#L43

Does that not do anything?

leonardo-albertovich commented 2 years ago

What that setting does is select between using c-ares which is asynchronous and the default system resolver which is synchronous. If having those DNS queries block is something you can accept then it should be fine and if you need to minimize the overhead you can use a non authoritative local caching DNS server like most modern distributions do (which in the end would be the same for both async and sync since the real query wouldn't be performed by fluent-bit).

PettitWesley commented 2 years ago

So we determined that setting this is a valid workaround for the problem:

net.dns.mode LEGACY

For Windows container users, all outputs will need this option set.

I am wondering if we could consider contributing a new environment variable that is like a global setting for net.dns.mode or a Service level setting. Something to provide a better user experience so that the config is only set once.

We could hide this new setting or env var behind a new CMake flag which would default to Off. So for example, I can enable it in the AWS distro, but if other community does not want to use the setting, by default it won't be built.

leonardo-albertovich commented 2 years ago

net.dns.mode can be set in the [SERVICE] section and overridden on a per plugin basis if desired, all of the DNS settings support that.

elafontaine commented 8 months ago

Why is this flagged as windows ? I had the same problem on flatcar, so I'm assuming that this is valid from any OS, right ?