fabric8io / fluent-plugin-kubernetes_metadata_filter

Enrich your fluentd events with Kubernetes metadata
Apache License 2.0
350 stars 166 forks source link

Fluentd Restart with " Exception encountered parsing pod watch event". #293

Closed SarthakSahu closed 1 year ago

SarthakSahu commented 3 years ago

'fluent-plugin-kubernetes_metadata_filter' version '2.4.2'

Fluentd pods are running as daemon set(20 instances) in our K8S cluster. Few Flunetd pods restarted with below error caused by kubernetes_metadata_filter. Please let me know why this pod has restarted and how can we overcome this issue.

As per this comments in the code it could be caused by Kubernetes API being temporarily down. But mean time pod has received the stats successfully.

2021-05-25 09:32:11 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 747, pod_cache_watch_ignored: 382202, namespace_cache_watch_misses: 14, pod_cache_watch_delete_ignored: 3052, pod_cache_watch_updates: 425, pod_cache_watch_misses: 12735, namespace_cache_api_updates: 54, pod_cache_api_updates: 54, id_cache_miss: 54, pod_cache_host_updates: 747, pod_watch_failures: 1
2021-05-25 09:33:11 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 747, pod_cache_watch_ignored: 387927, namespace_cache_watch_misses: 14, pod_cache_watch_delete_ignored: 3052, pod_cache_watch_updates: 425, pod_cache_watch_misses: 12735, namespace_cache_api_updates: 54, pod_cache_api_updates: 54, id_cache_miss: 54, pod_cache_host_updates: 747, pod_watch_failures: 1
2021-05-25 09:33:49 +0000 [error]: #0 Exception encountered parsing pod watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting.failed to connect: getaddrinfo: Name or service not known
2021-05-25 09:33:49 +0000 [warn]: #0 thread exited by unexpected error plugin=Fluent::Plugin::KubernetesMetadataFilter title=:pod_watch_thread error_class=Fluent::UnrecoverableError error="Exception encountered parsing pod watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting."
2021-05-25 09:33:49 +0000 [error]: #0 unexpected error error_class=Fluent::UnrecoverableError error="Exception encountered parsing pod watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting."
  2021-05-25 09:33:49 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluent-plugin-kubernetes_metadata_filter-2.4.2/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:64:in `rescue in set_up_pod_thread'
  2021-05-25 09:33:49 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluent-plugin-kubernetes_metadata_filter-2.4.2/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:39:in `set_up_pod_thread'
  2021-05-25 09:33:49 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluent-plugin-kubernetes_metadata_filter-2.4.2/lib/fluent/plugin/filter_kubernetes_metadata.rb:278:in `block in configure'
  2021-05-25 09:33:49 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.9.2/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2021-05-25 09:33:49 +0000 [error]: #0 Exception encountered parsing pod watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting. error_class=Fluent::UnrecoverableError error="Exception encountered parsing pod watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting."
  2021-05-25 09:33:49 +0000 [error]: #0 suppressed same stacktrace
2021-05-25 09:33:49 +0000 [info]: Worker 0 finished unexpectedly with status 2
2021-05-25 09:33:49 +0000 [info]: Received graceful stop
2021-05-25 09:33:49 +0000 [warn]: thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::KubernetesMetadataFilter title=:namespace_watch_thread thread=#<Thread:0x00000000042e2f30@namespace_watch_thread@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.9.2/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2021-05-25 09:33:49 +0000 [warn]: thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::KubernetesMetadataFilter title=:pod_watch_thread thread=#<Thread:0x00000000042e3138@pod_watch_thread@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.9.2/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
jcantrill commented 3 years ago

2021-05-25 09:33:49 +0000 [error]: #0 Exception encountered parsing pod watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting.failed to connect: getaddrinfo: Name or service not known

Your answer is in the stack trace in it looks like possibly a networking or DNS error.

2021-05-25 09:33:49 +0000 [warn]: #0 thread exited by unexpected error plugin=Fluent::Plugin::KubernetesMetadataFilter title=:pod_watch_thread error_class=Fluent::UnrecoverableError error="Exception encountered parsing pod watch event. The connection might have been closed. Retried 10 times yet still failing. Restarting."

It tried to connect 10 times and was unable

SarthakSahu commented 3 years ago

But how the stats has been received successfully during the failover period ? Is this the similar issue as #249 ? Is the fix available in v2.5.3 with #267 ?

jcantrill commented 3 years ago

But how the stats has been received successfully during the failover period ?

stats are dumped based on log messages being processed by the filter and if a certain amount of time has elapsed. It is the string representation of a hash. Additionally, if the filter is processing messages for which it already has an entry in its cache, it will not make an API call. The issue you references is related to the plugin watching the API server which happens in a thread outside the message processing loop. This means it is entirely possible logs have been sent through the filter but the API watch is failing.

Is this the similar issue as #249 ? Is the fix available in v2.5.3 with #267 ?

https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/267#issuecomment-765665539

SarthakSahu commented 3 years ago

Is this a bug ? Is this the similar to #249 .

jcantrill commented 1 year ago

Closing due to age