Closed rockb1017 closed 3 years ago
i looked into this further and I think it happens when connection is terminated by an external component. (ie load balaner, https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html)
right now we don't have rescue clause for ConnectionError
raised by http library during parsing.
Pull requests are welcome
I'm also seeing this in Azure Kubernetes clusters:
Unexpected error Exception encountered parsing namespace watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting.
/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb:70:in `rescue in set_up_namespace_thread'
/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb:39:in `set_up_namespace_thread'
/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/filter_kubernetes_metadata.rb:279:in `block in configure'
Hey folks, would you accept a PR that would disable triggering the Fluentd restart, for example if watch_retry_max_times
is set to -1
?
EDIT: Haha sorry removed the suggested code change as it didn't really make much sense 😅 Probably an additional if
statement would be needed around the line
Created a draft PR to show my idea: https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/266
The PR misses the same change for pods and also some tests. But just want to validate the idea - what do you think?
Hey folks, would you accept a PR that would disable triggering the Fluentd restart, for example if
watch_retry_max_times
is set to-1
?EDIT: Haha sorry removed the suggested code change as it didn't really make much sense Probably an additional
if
statement would be needed around the line
I would need to look at the code more extensively but is there any reason to disallow nil
implying its unset and to move on in lieu of -1?
Oh sure I guess nil
instead of -1
is just as good or better. I'll try to prepare the PR then and hopefully this will be enough to fix this issue.
I had a chat with folks on my team and the outcome is maybe it's not such a good idea to disable the Fluentd restarts alltogether - what if there is an actual problem connecting to Kubernetes API?
Instead, maybe it's a better idea to tune the retry count resetting logic? Currently, the retry count is being reset whenever an update comes from the watch
event, as added in this PR: https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/219. I believe the problem occurs when there are no updates from API server for a long time. New watches are created and terminated, and if this happens for long enough time, Fluentd restart is triggered.
What if we reset the retry count not only on updates from watch
, but also on a successful connection to API server?
Please let me know what you think.
See this draft PR: #267. It shows the concept of resetting retry count on successful connection to API server. Again, this is only a stub, if accepted I would replicate this change for pods, add tests and docs.
See this draft PR: #267. It shows the concept of resetting retry count on successful connection to API server. Again, this is only a stub, if accepted I would replicate this change for pods, add tests and docs.
I don't see any reason to need to restart fluent when this situation occurs. The only caveat is we may not have meta or the meta may be stale. This does not seem like a pressing issue. It's reasonable to reset and try again.
Great, thanks @jcantrill. I'll prepare the pull request #267.
We are affected too. Internal link: infra#57
@Bessonov did you have a look at PR #267? It is already accepted, hopefully it will be merged soon and new version will be released.
Looks like that the plugin has config watch: false
to disable this feature if you only need the pod name, container name, node name and namespace name. I tested that and it is working fine without restart.
it is deployed on oracle cloud kubernetes. this plugin version 2.5.2