Open dekimsey opened 2 years ago
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This is still an active issue.
I'll take a look, thanks for letting us know it's still a problem.
@leonardo-albertovich Now I feel bad! :/ lol.
When I submitted the report, I tested with the latest fluent-bit I could get my hands on (1.9.0 for my EC2 instances, 1.18.15 for my ECS Fargate containers) and I would observe the issue. I haven't seen anything since then that suggests it's been addressed which is what I should have replied to the bot with.
If you want me to vet, I can try a test tomorrow with 1.9.5 and see if I can trigger the behavior.
Definitely, it'd be great if you can double check it since that'd save me a bit of time trying to reproduce it in case it's fixed.
Thanks for staying on top of it regardless!
Okay, using the latest 1.9.6 with the following command I still see the issue.
fluent-bit -i cpu -o http -p host=www.example.com -v
With a valid entry in the first position of my resolv.conf
I get:
...
[2022/07/13 13:39:44] [ info] [output:http:http.0] www.example.com:80, HTTP status=200
When I place an invalid entry (or stop my local dnsmasq) in the first position:
...
[2022/07/13 13:39:24] [ warn] [net] getaddrinfo(host='www.example.com', err=12): Timeout while contacting DNS servers
Awesome, I'll take a look at it since I wrote that code, I think I have some ideas as to what could be the issue but I need to validate them and try to come up with a workaround.
Thanks for your help, please ping me back in a week if I don't answer since I have a few things on my plate right now and it could slip through the cracks.
Hi @leonardo-albertovich, just a gentle ping on this issue as requested.
Thank you!
Thanks for staying on top of it @dekimsey, I still haven't been able to take a look at it, I know what the issue in the mechanism is and have a few ideas to make it better but no time to get to it yet. If you are interested in working on it feel free to message me on slack and I can get you up to speed on it. Otherwise, I'll take a look at it as soon as possible.
Hi @leonardo-albertovich thank you for the offer but C is way outside my area of familiarity. I don't think I'd be effective. I'll wait patiently :)
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This is still an on-going issue
You are right @dekimsey, that is a work in progress but sadly we weren't able to include it in 2.0. I've added the exempt-stale
label to this issue so it doesn't go away until we release that improvement.
This issue can be mitigated by setting: https://docs.fluentbit.io/manual/administration/networking
net.dns.resolver LEGACY
Bug Report
Describe the bug Given two
nameserver
records in /etc/resolv.conf, fluent-bit doesn't appear to ever use the second record. In particular, when the first record is unavailable (connection refused), fluent-bit simply gives up and errors.To Reproduce
Expected behavior I would expect the application to fail-over and attempt resolution against the second nameserver entry.
Screenshots n/a
Your Environment
This has been observed in both the td-agent-bit packages (1.9.0) and the aws/aws-for-fluent-bit images (1.8.15).
Additional context We set
127.0.0.1
as our instances have local caching daemons running (dnsmasq). Fluent-bit does not appear to gracefully failover the DNS if the primary resolver is offline or net yet started.We've observed
v1.8.1
does not exhibit this behavior. I'm guessing this is the result of changes inv1.8.5
, but I have not bisected the releases to verify only skimmed the release notes.