network: Only the first nameserver in resolv.conf is ever used

dekimsey commented 2 years ago

Bug Report

Describe the bug Given two nameserver records in /etc/resolv.conf, fluent-bit doesn't appear to ever use the second record. In particular, when the first record is unavailable (connection refused), fluent-bit simply gives up and errors.

To Reproduce

Configure two DNS servers in resolv.conf (say 127.0.0.1 and a real value)

Shut down the first one

[2022/04/11 19:00:36] [ warn] [net] getaddrinfo(host='example.com', err=12): Timeout while contacting DNS servers

Expected behavior I would expect the application to fail-over and attempt resolution against the second nameserver entry.

Screenshots n/a

Your Environment

Version used: 1.8.15 and 1.9.0

Configuration:

[OUTPUT]
Name es
Match journal.*
Host elk.example.com
Port 443
Index logs-journal
Aws_Auth On
Aws_Region us-east-1
Tls  On

Environment name and version (e.g. Kubernetes? What version?): EC2 and ECS Fargate
Server type and version: n/a
Operating System and version: CentOS 7
Filters and plugins: n/a

This has been observed in both the td-agent-bit packages (1.9.0) and the aws/aws-for-fluent-bit images (1.8.15).

Additional context We set 127.0.0.1 as our instances have local caching daemons running (dnsmasq). Fluent-bit does not appear to gracefully failover the DNS if the primary resolver is offline or net yet started.

We've observed v1.8.1 does not exhibit this behavior. I'm guessing this is the result of changes in v1.8.5, but I have not bisected the releases to verify only skimmed the release notes.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

dekimsey commented 2 years ago

This is still an active issue.

leonardo-albertovich commented 2 years ago

I'll take a look, thanks for letting us know it's still a problem.

dekimsey commented 2 years ago

@leonardo-albertovich Now I feel bad! :/ lol.

When I submitted the report, I tested with the latest fluent-bit I could get my hands on (1.9.0 for my EC2 instances, 1.18.15 for my ECS Fargate containers) and I would observe the issue. I haven't seen anything since then that suggests it's been addressed which is what I should have replied to the bot with.

If you want me to vet, I can try a test tomorrow with 1.9.5 and see if I can trigger the behavior.

leonardo-albertovich commented 2 years ago

Definitely, it'd be great if you can double check it since that'd save me a bit of time trying to reproduce it in case it's fixed.

Thanks for staying on top of it regardless!

dekimsey commented 2 years ago

Okay, using the latest 1.9.6 with the following command I still see the issue.

fluent-bit -i cpu -o http -p host=www.example.com -v

With a valid entry in the first position of my resolv.conf I get:

...
[2022/07/13 13:39:44] [ info] [output:http:http.0] www.example.com:80, HTTP status=200

When I place an invalid entry (or stop my local dnsmasq) in the first position:

...
[2022/07/13 13:39:24] [ warn] [net] getaddrinfo(host='www.example.com', err=12): Timeout while contacting DNS servers

leonardo-albertovich commented 2 years ago

Awesome, I'll take a look at it since I wrote that code, I think I have some ideas as to what could be the issue but I need to validate them and try to come up with a workaround.

Thanks for your help, please ping me back in a week if I don't answer since I have a few things on my plate right now and it could slip through the cracks.

dekimsey commented 2 years ago

Hi @leonardo-albertovich, just a gentle ping on this issue as requested.

Thank you!

leonardo-albertovich commented 2 years ago

Thanks for staying on top of it @dekimsey, I still haven't been able to take a look at it, I know what the issue in the mechanism is and have a few ideas to make it better but no time to get to it yet. If you are interested in working on it feel free to message me on slack and I can get you up to speed on it. Otherwise, I'll take a look at it as soon as possible.

dekimsey commented 2 years ago

Hi @leonardo-albertovich thank you for the offer but C is way outside my area of familiarity. I don't think I'd be effective. I'll wait patiently :)

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

dekimsey commented 1 year ago

This is still an on-going issue

leonardo-albertovich commented 1 year ago

You are right @dekimsey, that is a work in progress but sadly we weren't able to include it in 2.0. I've added the exempt-stale label to this issue so it doesn't go away until we release that improvement.

PettitWesley commented 1 year ago

This issue can be mitigated by setting: https://docs.fluentbit.io/manual/administration/networking

net.dns.resolver LEGACY

fluent / fluent-bit

network: Only the first nameserver in resolv.conf is ever used #5298

Bug Report