fabric8io / fluent-plugin-kubernetes_metadata_filter

Enrich your fluentd events with Kubernetes metadata
Apache License 2.0
350 stars 166 forks source link

410 Gone encountered. Restarting pod watch to reset resource versions #362

Closed olyhao closed 1 year ago

olyhao commented 1 year ago

When fluentd collects business logs, [filter_kubernetes_metadata] 410 Gone encountered. Restarting pod watch to reset resource versions.410 Gone error appears. This problem is fatal to me. How to solve this problem, thank you

olyhao commented 1 year ago

image This is my version, what do I need to configure to solve this problem

olyhao commented 1 year ago
image
jcantrill commented 1 year ago

When fluentd collects business logs, [filter_kubernetes_metadata] 410 Gone encountered. Restarting pod watch to reset resource versions.410 Gone error appears. This problem is fatal to me

Please be more specific. How does this adversely effect your log processing. The 410 error implies the watch client needs to be rebuilt so the plugin can continue to receive call backs to keep its cache in sync. It is possible to completely disable the watch; I encourage you to review the configuration options on how to do that.

olyhao commented 1 year ago

fluentd收集业务日志时,遇到[filter_kubernetes_metadata] 410 Gone。重新启动 pod watch 以重置资源版本。出现 410 Gone 错误。这个问题对我来说是致命的

请更具体一点。这对您的日志处理有何不利影响。410 错误意味着需要重建 watch 客户端,以便插件可以继续接收回调以保持其缓存同步。可以完全禁用手表;我鼓励您查看有关如何执行此操作的配置选项。 image

Thank you for your reply; yesterday, my fluentd agent suddenly stopped collecting logs, and the fluentd agent log shows only: {"message":"[filter_kubernetes_metadata] 410 Gone encountered. Restarting pod watch to reset resource versions.410 Gone"} [info]: #0 [filter_kubernetes_metadata] 410 Gone encountered. Restarting pod watch to reset resource versions.410 Gone; this kind of error, I restart the fluentd agent to restart the log collection work, but restarting the pod can not completely solve my problem , this kind of operation is not advisable in the production environment; I can only get these log information. Is there any relationship between the fluentd agent stopping log collection and the 410 error? How can I go about solving this to avoid fluentd from stopping log collection;need your help very much,thank!!!

jcantrill commented 1 year ago

My first suggestion would be to turn off the watch completely and see if the issue persists. You could additionally increase the verbosity of the plugin to see if there is additional information that may be useful.

Turning off the watch means the cache will be built by making calls to the api server when an entry is not found instead of it being dynamically populated as pods come and go on the node. The cache is an LRU which is configured to expire entries based on max entries or time. Disabling the watch may be the best option unless you can provide more information to track down your issue.

I do not see how rebuilding the watch would block messages from flowing unless there is a bug which causes the client to indefinitely hang. The watch is intended to be re-established and continue as normal

olyhao commented 1 year ago

我的第一个建议是完全关闭手表,看看问题是否仍然存在。您还可以增加插件的详细程度,以查看是否有可能有用的其他信息。

关闭监视意味着缓存将通过在未找到条目时调用 api 服务器来构建,而不是随着 pod 在节点上来来去去动态填充。缓存是一个 LRU,它被配置为根据最大条目数或时间使条目过期。除非您可以提供更多信息来追踪您的问题,否则禁用手表可能是最佳选择。

我看不出重建手表会如何阻止消息流动,除非存在导致客户端无限期挂起的错误。手表旨在重新建立并继续正常

Thank you for your answer; I would also like to ask if I turn on the "watch false" configuration when there are many pods, will the pressure on the apiserver service be great?

jcantrill commented 1 year ago

I can not speak to what behavior you will encounter for your usecase. The cache is configured for 1000 entries and 3600s expiration time. This means assuming the number of pods are consistent on any given node, you should see an early spike and then limited queries as pods come and go. If you are collecting logs from a build server where pods are short lived, then I would expect the load against the API server to increase given you are likely to evict entries from the cache more frequently.