Open jblang opened 2 years ago
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
I'm a bit disappointed that this has not been triaged for over 90 days. It's a pretty serious issue with the potential to knock out DNS for an entire k8s cluster. It caused a production outage for us.
I removed the stale, I recall a few DNS updates upcoming in 2.0 - @leonardo-albertovich any settings you know of?
@jblang would you be able to contribute a retry mechanism?
I'll take a look when I get a chance and see what I can do.
I can confirm that this is still an issue in FB 2.0.6. We had another outage after upgrading from 1.9.5. We have also changed Retry_Limit
from no_limit
to 3
, but it didn't have any effect. Retries back off exponentially and chunks are dropped after 3 tries, but initial send of new chunks are attempted immediately even if attempts to send previous chunks have all failed.
That this lack of global backoff triggers a DNS storm is really a side-effect. However, I'm wondering if enabling keepalive might reduce the number of times connections are established and therefore the number of DNS requests.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
I'm sorry, I totally missed this. For the moment, the only workaround would be setting dns.resolver
to LEGACY
in the SERVICE
section or net.dns.resolver
to LEGACY
in each output plugin if setting it globally would interfere with other plugins.
I marked this as exempt-stale so we can leave it open until it doesn't slip through the cracks again.
Thanks for the suggestion. I will attempt to reproduce in a dev cluster by intentionally deploying fluent-bit without Vector deployed. If I'm able to reprodce, then I'll see if setting dns.resolver
to LEGACY
makes any difference. Can you explain how you expect this setting to work around the issue? I saw #6892, which sounds similar, and that you recommended the same thing there. However on that issue they reported that downgrading to 1.9 resolved the issue. However, I encountered the issue originally on 1.9.5 and then again on 2.0.6, so if that setting just reverts behavior to what 1.9 does, I'm not hopeful that it will fix anything.
IIRC the DNS client was changed in 1.9.5, and it's been a bumpy ride so far... Totally my fault as well.
I think that because each connection attempt sets up its own DNS client context there is no way for the system to be aware of the failure which wouldn't happen when using the system resolver so even if fluent-bit kept on trying to query the same record over and over I'm fairly sure that the system would handle it way more gracefully than the current internal DNS client.
Improving the DNS client is something we are planning on but still not quite there.
Please let me know if there is anything that's not clear, I'd be glad to share more information.
Thanks for the explanation. I will let you know the outcome of my testing when I get a chance to do it.
Unfortunately setting the DNS resolver to LEGACY did not do the trick. You can see the result of our testing here. The earlier spikes are from testing with the default DNS resolver, and later spikes are with the legacy resolver. They're about the same...
You can also see the impact that the DNS flood has on response latency...
It was somewhat mild this time but on a heavily loaded cluster with lots of DNS entries it can be enough to tip over CoreDNS with an OOM.
That's interesting, which version are you running BTW? I need to verify that the DNS resolver setting is being honored because someone mentioned in a different issue that they thought it was not the case.
After re-reading your original issue I remember what you are talking about and know for a fact that this "delay" is not something that's part of the design at the moment. I'm fairly sure a contributor submitted a circuit breaker implementation a long time ago which to some degree should cover your needs but I don't think I was part of the process and I can't make any promises about that at the moment but you might be able to give his branch a try and help give it some traction.
What do you think?
We hit this bug today (with a k8s daemonset, Kinesis Firehose output, and Retry_Limit 5). We found we were getting a storm of output of fluent-bit logging that it could not send logs to kinesis then logging again, but also a DNS explosion that couldn't keep up with 38 coredns instances (normally, we need 3). Once DNS was overloaded, the lookup of kinesis would fail and we were into a vicious circle. Looking at the code here to see if there is a way to prevent a re-occurence, I see that calls to flb_net_getaddrinfo: https://github.com/fluent/fluent-bit/blob/9a173d0b32d288c022ec751510dd929edde04c95/src/flb_network.c#L1024 create a lookup context: https://github.com/fluent/fluent-bit/blob/9a173d0b32d288c022ec751510dd929edde04c95/src/flb_network.c#L1057 which calls ares_init_options: https://github.com/fluent/fluent-bit/blob/9a173d0b32d288c022ec751510dd929edde04c95/src/flb_network.c#L991 ... thus setting up an ares channel for every ~single lookup~ (edited: reading more code, looks like one per thread)?
The c-ares docs say to only call that once in your application: https://c-ares.org/ares_init.html - I could be way off here but that seems like it would bypass whatever c-ares' normal request queueing/deduplication mechanisms are and could be contributing to the problem?
For anyone else hitting the form of this bug that I described, we got around it by setting Exclude_Path in the Tail plugin to exclude the fluent-bit logs, and changing the storage.path to a new location so that the mass of error messages from fluent-bit were skipped (at the cost of losing a window of our application logs). That was before we found this bug report; that what we did worked at all suggests the dns lookups are somehow in proportion to the log volume instead of having a shared view of what requests succeed or fail, and having a constant rate of dns requests as we hit the ttl. That's why I'm reading here; we're mitigated for now but if we have another high volume logger it looks like it could happen again.
Just reading this again, noticed another bug in the ares init call: https://github.com/fluent/fluent-bit/blob/master/src/flb_network.c#L1003-L1011
optmask = ARES_OPT_FLAGS; but this means only the FLAGS option will be processed, the next line sets opts.tries to 2 but that will have no effect. It should be optmask = ARES_OPT_FLAGS | ARES_OPT_TRIES;
right? See the code in ares here:
https://github.com/c-ares/c-ares/blob/e4a4346596c62621704371270220b567b0ddd83d/src/lib/ares_options.c#L310-L316
I looked again at this call because I wondered if the use of multiple ares_channels was bypassing the query cache (since that is per-channel). And I see the query cache is not enabled. It's a choice; the legacy call to getaddrinfo also looks like it doesn't cache. But caching would likely help with this bug.
Bug Report
Describe the bug
Fluent Bit appears to be applying the retry delay for each individual chunk but it keeps trying to send new chunks from the logs it is tailing without any delay. The
Retry_Limit
andscheduler
settings appear to have no effect on this behavior, because technically these are not retries, but rather continued attempts to send new chunks even when the destination has been consistently unavailable for previous chunks.In our case, this caused a DoS on our CoreDNS server. We have Fluent Bit set up as log tailer on our k8s nodes, sending to Vector's Fluent source. If Fluent Bit is deployed before Vector, the vector hostname will not exist in DNS yet, and Fluent Bit will just keep spamming DNS requests for each new chunk. This caused a total of 1.2M DNS requests in less than 15 minutes and caused our CoreDNS pods to OOM. DNS for the entire cluster was unavailable and resulted in downtime for other customer-facing applications.
This is aggregated across all 189 fluent pods in our k8s cluster:
Logs start out looking like this:
Then after a few seconds we start getting timeouts as the DNS server gets overloaded:
Finally after Vector is deployed and CoreDNS is restarted, chunks start to succeed:
To Reproduce
Expected behavior
Fluent bit should a retry delay to all DNS requests globally, not just on a per-chunk basis.
Your Environment