fabric8io / fluent-plugin-kubernetes_metadata_filter

Enrich your fluentd events with Kubernetes metadata
Apache License 2.0
350 stars 166 forks source link

fail to parse watch call response and crashes pod. #249

Closed rockb1017 closed 3 years ago

rockb1017 commented 4 years ago
2020-08-28 15:36:27 +0000 [error]: Exception encountered parsing pod watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting.error reading from socket: Could not parse data
#<Thread:0x0000562808f18f70@/usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/filter_kubernetes_metadata.rb:276 run> terminated with exception (report_on_exception is true):
/usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:71:in `rescue in set_up_pod_thread': Exception encountered parsing pod watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting. (Fluent::UnrecoverableError)
    from /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:40:in `set_up_pod_thread'
    from /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/filter_kubernetes_metadata.rb:276:in `block in configure'
Unexpected error Exception encountered parsing pod watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting.
  /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:71:in `rescue in set_up_pod_thread'
  /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:40:in `set_up_pod_thread'
  /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/filter_kubernetes_metadata.rb:276:in `block in configure'

it is deployed on oracle cloud kubernetes. this plugin version 2.5.2

rockb1017 commented 4 years ago

i looked into this further and I think it happens when connection is terminated by an external component. (ie load balaner, https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html) right now we don't have rescue clause for ConnectionError raised by http library during parsing.

jcantrill commented 4 years ago

Pull requests are welcome

djj0809 commented 3 years ago

I'm also seeing this in Azure Kubernetes clusters:

Unexpected error Exception encountered parsing namespace watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting.
  /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb:70:in `rescue in set_up_namespace_thread'
  /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb:39:in `set_up_namespace_thread'
  /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/filter_kubernetes_metadata.rb:279:in `block in configure'
andrzej-stencel commented 3 years ago

Hey folks, would you accept a PR that would disable triggering the Fluentd restart, for example if watch_retry_max_times is set to -1?

EDIT: Haha sorry removed the suggested code change as it didn't really make much sense 😅 Probably an additional if statement would be needed around the line

https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/84f66a8f9e06ab5b5211053fcce4cd8ab4bd74ba/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb#L51

andrzej-stencel commented 3 years ago

Created a draft PR to show my idea: https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/266

The PR misses the same change for pods and also some tests. But just want to validate the idea - what do you think?

jcantrill commented 3 years ago

Hey folks, would you accept a PR that would disable triggering the Fluentd restart, for example if watch_retry_max_times is set to -1?

EDIT: Haha sorry removed the suggested code change as it didn't really make much sense Probably an additional if statement would be needed around the line

https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/84f66a8f9e06ab5b5211053fcce4cd8ab4bd74ba/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb#L51

I would need to look at the code more extensively but is there any reason to disallow nil implying its unset and to move on in lieu of -1?

andrzej-stencel commented 3 years ago

Oh sure I guess nil instead of -1 is just as good or better. I'll try to prepare the PR then and hopefully this will be enough to fix this issue.

andrzej-stencel commented 3 years ago

I had a chat with folks on my team and the outcome is maybe it's not such a good idea to disable the Fluentd restarts alltogether - what if there is an actual problem connecting to Kubernetes API?

Instead, maybe it's a better idea to tune the retry count resetting logic? Currently, the retry count is being reset whenever an update comes from the watch event, as added in this PR: https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/219. I believe the problem occurs when there are no updates from API server for a long time. New watches are created and terminated, and if this happens for long enough time, Fluentd restart is triggered.

What if we reset the retry count not only on updates from watch, but also on a successful connection to API server?

Please let me know what you think.

andrzej-stencel commented 3 years ago

See this draft PR: #267. It shows the concept of resetting retry count on successful connection to API server. Again, this is only a stub, if accepted I would replicate this change for pods, add tests and docs.

jcantrill commented 3 years ago

See this draft PR: #267. It shows the concept of resetting retry count on successful connection to API server. Again, this is only a stub, if accepted I would replicate this change for pods, add tests and docs.

I don't see any reason to need to restart fluent when this situation occurs. The only caveat is we may not have meta or the meta may be stale. This does not seem like a pressing issue. It's reasonable to reset and try again.

andrzej-stencel commented 3 years ago

Great, thanks @jcantrill. I'll prepare the pull request #267.

Bessonov commented 3 years ago

We are affected too. Internal link: infra#57

andrzej-stencel commented 3 years ago

@Bessonov did you have a look at PR #267? It is already accepted, hopefully it will be merged soon and new version will be released.

djj0809 commented 3 years ago

Looks like that the plugin has config watch: false to disable this feature if you only need the pod name, container name, node name and namespace name. I tested that and it is working fine without restart.