Backoff timer is not applied to chunks globally

jblang commented 2 years ago

Bug Report

Describe the bug

Fluent Bit appears to be applying the retry delay for each individual chunk but it keeps trying to send new chunks from the logs it is tailing without any delay. The Retry_Limit and scheduler settings appear to have no effect on this behavior, because technically these are not retries, but rather continued attempts to send new chunks even when the destination has been consistently unavailable for previous chunks.

In our case, this caused a DoS on our CoreDNS server. We have Fluent Bit set up as log tailer on our k8s nodes, sending to Vector's Fluent source. If Fluent Bit is deployed before Vector, the vector hostname will not exist in DNS yet, and Fluent Bit will just keep spamming DNS requests for each new chunk. This caused a total of 1.2M DNS requests in less than 15 minutes and caused our CoreDNS pods to OOM. DNS for the entire cluster was unavailable and resulted in downtime for other customer-facing applications.

This is aggregated across all 189 fluent pods in our k8s cluster:

$ grep "Timeout while contacting DNS servers" fluent-bit-567850229182.log | wc -l
1275364

$ grep "Timeout while contacting DNS servers" fluent-bit-567850229182.log | awk '{print $2}' | cut -d: -f1,2 | sort | uniq -c
      1 19:47
  41112 19:48
  78822 19:49
  88603 19:50
 101609 19:51
 104123 19:52
 113327 19:53
  95675 19:54
  52299 19:55
 125746 19:56
 126956 19:57
 128904 19:58
 131091 19:59
  87096 20:00

Logs start out looking like this:

[2022/06/21 19:47:33] [ warn] [net] getaddrinfo(host='vector.platform-observability', err=4): Domain name not found
[2022/06/21 19:47:33] [error] [output:forward:forward.0] no upstream connections available
[2022/06/21 19:47:33] [ warn] [engine] failed to flush chunk '1-1655840852.784688832.flb', retry in 8 seconds: task_id=0, input=tail.1 > output=forward.0 (out_id=0)
[2022/06/21 19:47:34] [ warn] [net] getaddrinfo(host='vector.platform-observability', err=4): Domain name not found
[2022/06/21 19:47:34] [error] [output:forward:forward.0] no upstream connections available
[2022/06/21 19:47:34] [ warn] [engine] failed to flush chunk '1-1655840853.538056562.flb', retry in 10 seconds: task_id=2, input=systemd.2 > output=forward.0 (out_id=0)
[2022/06/21 19:47:34] [ warn] [net] getaddrinfo(host='vector.platform-observability', err=4): Domain name not found
[2022/06/21 19:47:34] [error] [output:forward:forward.0] no upstream connections available

Then after a few seconds we start getting timeouts as the DNS server gets overloaded:

[2022/06/21 19:47:59] [ warn] [engine] failed to flush chunk '1-1655840868.200737091.flb', retry in 19 seconds: task_id=25, input=tail.1 > output=forward.0 (out_id=0)
[2022/06/21 19:48:08] [ warn] [net] getaddrinfo(host='vector.platform-observability', err=12): Timeout while contacting DNS servers
[2022/06/21 19:48:08] [error] [output:forward:forward.0] no upstream connections available
[2022/06/21 19:48:08] [ warn] [net] getaddrinfo(host='vector.platform-observability', err=12): Timeout while contacting DNS servers
[2022/06/21 19:48:08] [error] [output:forward:forward.0] no upstream connections available
[2022/06/21 19:48:08] [ warn] [engine] failed to flush chunk '1-1655840871.198934080.flb', retry in 7 seconds: task_id=28, input=tail.0 > output=forward.0 (out_id=0)
[2022/06/21 19:48:08] [ warn] [engine] failed to flush chunk '1-1655840872.200573639.flb', retry in 15 seconds: task_id=32, input=tail.1 > output=forward.0 (out_id=0)
[2022/06/21 19:48:08] [ warn] [net] getaddrinfo(host='vector.platform-observability', err=12): Timeout while contacting DNS servers
[2022/06/21 19:48:08] [error] [output:forward:forward.0] no upstream connections available
[2022/06/21 19:48:08] [ warn] [engine] failed to flush chunk '1-1655840878.200794762.flb', retry in 6 seconds: task_id=40, input=tail.1 > output=forward.0 (out_id=0)
[2022/06/21 19:48:08] [ warn] [net] getaddrinfo(host='vector.platform-observability', err=11): Could not contact DNS servers
[2022/06/21 19:48:08] [error] [output:forward:forward.0] no upstream connections available

Finally after Vector is deployed and CoreDNS is restarted, chunks start to succeed:

[2022/06/21 20:00:49] [ info] [engine] flush chunk '1-1655841607.404448798.flb' succeeded at retry 3: task_id=1147, input=tail.0 > output=forward.0 (out_id=0)
[2022/06/21 20:00:49] [ info] [engine] flush chunk '1-1655841632.662283630.flb' succeeded at retry 2: task_id=1190, input=systemd.2 > output=forward.0 (out_id=0)
[2022/06/21 20:00:49] [ info] [engine] flush chunk '1-1655840896.109007641.flb' succeeded at retry 7: task_id=61, input=tail.1 > output=forward.0 (out_id=0)
[2022/06/21 20:00:49] [ info] [engine] flush chunk '1-1655840866.200703985.flb' succeeded at retry 7: task_id=22, input=tail.1 > output=forward.0 (out_id=0)

To Reproduce

Example log message if applicable: See above.
Steps to reproduce the problem: See above.

Expected behavior

Fluent bit should a retry delay to all DNS requests globally, not just on a per-chunk basis.

Your Environment

Version used: 1.9.5
Configuration:

    [OUTPUT]
        Name          forward
        Match         *
        # update these to point to your vector instance
        Host          {{ .Values.vector.host }} 
        Port          {{ .Values.vector.port }} 
        Retry_Limit   3

Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.18.20-eks
Filters and plugins: No plugins. Filters: modify, grep, kuberentes

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

jblang commented 2 years ago

I'm a bit disappointed that this has not been triaged for over 90 days. It's a pretty serious issue with the potential to knock out DNS for an entire k8s cluster. It caused a production outage for us.

agup006 commented 2 years ago

I removed the stale, I recall a few DNS updates upcoming in 2.0 - @leonardo-albertovich any settings you know of?

@jblang would you be able to contribute a retry mechanism?

jblang commented 2 years ago

I'll take a look when I get a chance and see what I can do.

jblang commented 1 year ago

I can confirm that this is still an issue in FB 2.0.6. We had another outage after upgrading from 1.9.5. We have also changed Retry_Limit from no_limit to 3, but it didn't have any effect. Retries back off exponentially and chunks are dropped after 3 tries, but initial send of new chunks are attempted immediately even if attempts to send previous chunks have all failed.

That this lack of global backoff triggers a DNS storm is really a side-effect. However, I'm wondering if enabling keepalive might reduce the number of times connections are established and therefore the number of DNS requests.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

leonardo-albertovich commented 1 year ago

I'm sorry, I totally missed this. For the moment, the only workaround would be setting dns.resolver to LEGACY in the SERVICE section or net.dns.resolver to LEGACY in each output plugin if setting it globally would interfere with other plugins.

I marked this as exempt-stale so we can leave it open until it doesn't slip through the cracks again.

jblang commented 1 year ago

Thanks for the suggestion. I will attempt to reproduce in a dev cluster by intentionally deploying fluent-bit without Vector deployed. If I'm able to reprodce, then I'll see if setting dns.resolver to LEGACY makes any difference. Can you explain how you expect this setting to work around the issue? I saw #6892, which sounds similar, and that you recommended the same thing there. However on that issue they reported that downgrading to 1.9 resolved the issue. However, I encountered the issue originally on 1.9.5 and then again on 2.0.6, so if that setting just reverts behavior to what 1.9 does, I'm not hopeful that it will fix anything.

leonardo-albertovich commented 1 year ago

IIRC the DNS client was changed in 1.9.5, and it's been a bumpy ride so far... Totally my fault as well.

I think that because each connection attempt sets up its own DNS client context there is no way for the system to be aware of the failure which wouldn't happen when using the system resolver so even if fluent-bit kept on trying to query the same record over and over I'm fairly sure that the system would handle it way more gracefully than the current internal DNS client.

Improving the DNS client is something we are planning on but still not quite there.

Please let me know if there is anything that's not clear, I'd be glad to share more information.

jblang commented 1 year ago

Thanks for the explanation. I will let you know the outcome of my testing when I get a chance to do it.

jblang commented 1 year ago

Unfortunately setting the DNS resolver to LEGACY did not do the trick. You can see the result of our testing here. The earlier spikes are from testing with the default DNS resolver, and later spikes are with the legacy resolver. They're about the same...

You can also see the impact that the DNS flood has on response latency...

It was somewhat mild this time but on a heavily loaded cluster with lots of DNS entries it can be enough to tip over CoreDNS with an OOM.

leonardo-albertovich commented 1 year ago

That's interesting, which version are you running BTW? I need to verify that the DNS resolver setting is being honored because someone mentioned in a different issue that they thought it was not the case.

After re-reading your original issue I remember what you are talking about and know for a fact that this "delay" is not something that's part of the design at the moment. I'm fairly sure a contributor submitted a circuit breaker implementation a long time ago which to some degree should cover your needs but I don't think I was part of the process and I can't make any promises about that at the moment but you might be able to give his branch a try and help give it some traction.

What do you think?

bazzargh commented 8 months ago

We hit this bug today (with a k8s daemonset, Kinesis Firehose output, and Retry_Limit 5). We found we were getting a storm of output of fluent-bit logging that it could not send logs to kinesis then logging again, but also a DNS explosion that couldn't keep up with 38 coredns instances (normally, we need 3). Once DNS was overloaded, the lookup of kinesis would fail and we were into a vicious circle. Looking at the code here to see if there is a way to prevent a re-occurence, I see that calls to flb_net_getaddrinfo: https://github.com/fluent/fluent-bit/blob/9a173d0b32d288c022ec751510dd929edde04c95/src/flb_network.c#L1024 create a lookup context: https://github.com/fluent/fluent-bit/blob/9a173d0b32d288c022ec751510dd929edde04c95/src/flb_network.c#L1057 which calls ares_init_options: https://github.com/fluent/fluent-bit/blob/9a173d0b32d288c022ec751510dd929edde04c95/src/flb_network.c#L991 ... thus setting up an ares channel for every ~single lookup~ (edited: reading more code, looks like one per thread)?

The c-ares docs say to only call that once in your application: https://c-ares.org/ares_init.html - I could be way off here but that seems like it would bypass whatever c-ares' normal request queueing/deduplication mechanisms are and could be contributing to the problem?

For anyone else hitting the form of this bug that I described, we got around it by setting Exclude_Path in the Tail plugin to exclude the fluent-bit logs, and changing the storage.path to a new location so that the mass of error messages from fluent-bit were skipped (at the cost of losing a window of our application logs). That was before we found this bug report; that what we did worked at all suggests the dns lookups are somehow in proportion to the log volume instead of having a shared view of what requests succeed or fail, and having a constant rate of dns requests as we hit the ttl. That's why I'm reading here; we're mitigated for now but if we have another high volume logger it looks like it could happen again.

bewinsnw commented 8 months ago

Just reading this again, noticed another bug in the ares init call: https://github.com/fluent/fluent-bit/blob/master/src/flb_network.c#L1003-L1011

optmask = ARES_OPT_FLAGS; but this means only the FLAGS option will be processed, the next line sets opts.tries to 2 but that will have no effect. It should be optmask = ARES_OPT_FLAGS | ARES_OPT_TRIES; right? See the code in ares here: https://github.com/c-ares/c-ares/blob/e4a4346596c62621704371270220b567b0ddd83d/src/lib/ares_options.c#L310-L316

I looked again at this call because I wondered if the use of multiple ares_channels was bypassing the query cache (since that is per-channel). And I see the query cache is not enabled. It's a choice; the legacy call to getaddrinfo also looks like it doesn't cache. But caching would likely help with this bug.

fluent / fluent-bit

Backoff timer is not applied to chunks globally #5613

Bug Report