fabric8io / fluent-plugin-kubernetes_metadata_filter

Enrich your fluentd events with Kubernetes metadata
Apache License 2.0
350 stars 166 forks source link

Fluentd crashes when watching namespace metadata #239

Closed julienlefur closed 4 years ago

julienlefur commented 4 years ago

The watch connection on the API Server seems to be closed regularly by one of the following : Kubeclient / http / apiserver

When no modifications are made on the namespaces, the namespace_watch_retry_count is constantly increasing due to this error: 2020-06-26 11:59:04 +0000 [info]: #0 [filter_kube_metadata] Exception encountered parsing namespace watch event. The connection might have been closed. Sleeping for 128 seconds and resetting the namespace watcher.error reading from socket: Could not parse data

When the max is reached, Fluentd crashes and restarts.

https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/940296bac92ac05b97654e156dcf16c1eacd21b5/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb#L43-L55

The only way to reset the namespace_watch_retry_count is to make a change on a namespace so the function reset_namespace_watch_retry_stats is called. But when no modification are done on the namespace, fluentd crashes after 10 'connection closed' errors.

Would it be possible to catch the 'normal' connection close errors to avoid this behaviour?

It seems to be linked to this issue on Kubeclient: https://github.com/abonas/kubeclient/issues/273

jcantrill commented 4 years ago

fixed by https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/247

julienlefur commented 4 years ago

@jcantrill I still have the same behaviour with the version 2.5.2 The connection to the api-server is closed regularly. The stat counter "namespace_watch_failures" is incremented until 10 and fluentd crashes. I have a cronjob that runs to apply a change in a namespace in order to make fluentd reset this counter. The is a workaround to prevent fluentd to crash. I'll keep you posted if I can dig deeper and find a few things.

mashail commented 4 years ago

@jcantrill we upgraded to 2.5.2 and we still get the same issue. fluentd is crashing every hour

2020-08-01 15:53:44 +0300 [error]: Exception encountered parsing namespace watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting.error reading from socket: Could not parse data
#<Thread:0x000055ea2b3c0d78@/opt/bitnami/fluentd/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/filter_kubernetes_metadata.rb:279 run> terminated with exception (report_on_exception is true):
/opt/bitnami/fluentd/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb:70:in `rescue in set_up_namespace_thread': Exception encountered parsing namespace watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting. (Fluent::UnrecoverableError)
    from /opt/bitnami/fluentd/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb:39:in `set_up_namespace_thread'
    from /opt/bitnami/fluentd/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/filter_kubernetes_metadata.rb:279:in `block in configure'
/opt/bitnami/fluentd/gems/http-4.4.1/lib/http/response/parser.rb:31:in `add': error reading from socket: Could not parse data (HTTP::ConnectionError)
    from /opt/bitnami/fluentd/gems/http-4.4.1/lib/http/connection.rb:214:in `read_more'
    from /opt/bitnami/fluentd/gems/http-4.4.1/lib/http/connection.rb:92:in `readpartial'
    from /opt/bitnami/fluentd/gems/http-4.4.1/lib/http/response/body.rb:30:in `readpartial'
    from /opt/bitnami/fluentd/gems/http-4.4.1/lib/http/response/body.rb:36:in `each'
    from /opt/bitnami/fluentd/gems/kubeclient-4.8.0/lib/kubeclient/watch_stream.rb:25:in `each'
    from /opt/bitnami/fluentd/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb:114:in `process_namespace_watcher_notices'
    from /opt/bitnami/fluentd/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb:41:in `set_up_namespace_thread'
    from /opt/bitnami/fluentd/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/filter_kubernetes_metadata.rb:279:in `block in configure'
/opt/bitnami/fluentd/gems/http-4.4.1/lib/http/response/parser.rb:31:in `add': Could not parse data (IOError)
    from /opt/bitnami/fluentd/gems/http-4.4.1/lib/http/connection.rb:214:in `read_more'
    from /opt/bitnami/fluentd/gems/http-4.4.1/lib/http/connection.rb:92:in `readpartial'
    from /opt/bitnami/fluentd/gems/http-4.4.1/lib/http/response/body.rb:30:in `readpartial'
    from /opt/bitnami/fluentd/gems/http-4.4.1/lib/http/response/body.rb:36:in `each'
    from /opt/bitnami/fluentd/gems/kubeclient-4.8.0/lib/kubeclient/watch_stream.rb:25:in `each'
    from /opt/bitnami/fluentd/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb:114:in `process_namespace_watcher_notices'
    from /opt/bitnami/fluentd/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb:41:in `set_up_namespace_thread'
    from /opt/bitnami/fluentd/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/filter_kubernetes_metadata.rb:279:in `block in configure'
Unexpected error Exception encountered parsing namespace watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting.
  /opt/bitnami/fluentd/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb:70:in `rescue in set_up_namespace_thread'
  /opt/bitnami/fluentd/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb:39:in `set_up_namespace_thread'
  /opt/bitnami/fluentd/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/lib/fluent/plugin/filter_kubernetes_metadata.rb:279:in `block in configure'

I enabled trace log for the plugin trying to figure out the issue but I wasn't lucky and I don't want to increase the retry because it will eventually crash I want to diagnose the root cause and solve it. Can you advise.