fabric8io / fluent-plugin-kubernetes_metadata_filter

Enrich your fluentd events with Kubernetes metadata
Apache License 2.0
350 stars 166 forks source link

Dreaded "can't add a new key into hash during iteration" again #381

Open jcrood opened 4 months ago

jcrood commented 4 months ago

I installed fluentd on k8s using fluentd-kubernetes-daemonset v1.16.5 and ocasionally have it logging an error from the metadata plugin:

2024-04-19 09:18:20 +0000 [warn]: #0 dump an error event: error_class=RuntimeError error="can't add a new key into hash during iteration" location="/fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-kubernetes_metadata_filter-3.4.0/lib/fluent/plugin/kubernetes_metadata_cache_strategy.rb:86:in 'merge!'" tag="kubernetes.var.log.containers.calico-node-hrnj2_kube-system_calico-node-455d05a2c5e6b90ce4444c083d2aeeebf89c454d1588ea6fe7d314a7eaa61e6e.log" time=2024-04-14 22:15:41.969400595 +0000 record={"stream"=>"stdout", "logtag"=>"F", "message"=>"2024-04-14 22:15:41.968 [INFO][94] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m4.6s: avg=7ms longest=22ms ()", "time"=>"2024-04-15T00:15:41.969400595+02:00", "docker"=>{"container_id"=>"455d05a2c5e6b90ce4444c083d2aeeebf89c454d1588ea6fe7d314a7eaa61e6e"}, "kubernetes"=>{"container_name"=>"calico-node", "namespace_name"=>"kube-system", "pod_name"=>"calico-node-hrnj2", "container_image"=>"quay.io/calico/node:v3.24.5", "container_image_id"=>"sha256:54637cb36d4a1c029fb994c6fc88af04791c1f2dcbd12a24aa995c0bffaacdb1", "pod_id"=>"b3573c8d-a8a1-4d46-bd63-a447b609e71f", "pod_ip"=>"10.254.6.12", "host"=>"m02.c042.k8s.example.com", "labels"=>{"controller-revision-hash"=>"79ddc99967", "k8s-app"=>"calico-node", "pod-template-generation"=>"4"}, "master_url"=>"https://10.233.0.1:443/api"}, "@timestamp"=>"2024-04-14T22:15:41.969400595+00:00"}

This appears to be similar to #237 and #260, except it occurs in a different part of the caching code and in a newer version. I've tried shuffling some code around, taking out the namespace_cache fetch and isolating the merge! call and it keeps failing on that. It's very well possible that it trips up over ruby's own https://github.com/ruby/ruby/blob/ruby_2_3/hash.c#L2315.

Edited to add: having dug a bit deeper info what's happening, it appears that this can occur because both merge!s (here and [here]https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/v3.4.0/lib/fluent/plugin/kubernetes_metadata_cache_strategy.rb#L86)) can happen on cache entries of the pod metadata and is not thread safe. I've currently "fixed" it for myself by using plain merge and writing back the updated copy back to the cache, which is synchronised.