fluent / fluentd

Fluentd: Unified Logging Layer (project under CNCF)
https://www.fluentd.org
Apache License 2.0
12.89k stars 1.34k forks source link

buffer space has too many data errors on k8s cluster #2411

Open gauravaroraoyo opened 5 years ago

gauravaroraoyo commented 5 years ago

The fluentd server itself runs on a dedicated system outside of the kubernetes cluster. We do see a few warnings on it from time to time

#0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=24.349883934482932 slow_flush_log_threshold=20.0 plugin_id="object:3f9090f98e80"

We've tried setting every setting for chunk_limit and flush settings to get rid of this error but it doesn't seem to go away. Is there an obvious error in our configuration that we're missing?

gauravaroraoyo commented 5 years ago

Would appreciate some guidance here on how we can go about debugging this further.

savitharaghunathan commented 5 years ago

Even with the updated fluentd image, I get the same error. Any pointers on resolving this would be appreciated :)

0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.4.2/lib/fluent/plugin/buffer.rb:298:in `write'" tag="kubernetes.var.log.containers.fluentd-lslhj_kube-logging_fluentd-3865402aacdaa7793473d31de0c6a9d604cfab3cbc39bbf3bba12b70e473137c.log"

epcim commented 5 years ago

I have the same issue, surprisingly restart of fluend works for a while.

I would appreciate a guidance as well. It's not clear what buffer is over and how to set size (for buffer/chunk/queue limit) properly. In my case, fluentbit forwards to fluentd that forwards to another fluentd. (the buffere overflow errors I see the most in the last fluentd in the row)

[328] kube.var.log.containers.fluentd-79cc4cffbd-d9cdg_sre_fluentd-dccc4f286753b75a53c464446af44ffcbeba5ad3a21c9a947a11e94f4c6892b2.log: [1560431258.193260514, {"log"=>"2019-06-13 13:07:38 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/usr/lib/ruby/gems/2.5.0/gems/fluentd-1.2.6/lib/fluent/plugin/buffer.rb:269:in `write'" tag="raw.kube.app.obelix" [330] kube.var.log.containers.fluentd-79cc4cffbd-d9cdg_sre_fluentd-dccc4f286753b75a53c464446af44ffcbeba5ad3a21c9a947a11e94f4c6892b2.log: [1560431258.193283014, {"log"=>"2019-06-13 13:07:38 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/usr/lib/ruby/gems/2.5.0/gems/fluentd-1.2.6/lib/fluent/plugin/buffer.rb:269:in `write'" tag="kube.var.log.containers.obelix-j6h2n_ves-system_obelix-74bc7f7ecbcb9981c5f39eab9d85b855c5145f299d71d68ad4bef8f223653327.log"

apacoco9861 commented 5 years ago

I also got error 2019-07-02 09:58:09 +0000 [warn]: #0 [out_es] failed to write data into buffer by buffer overflow action=:throw_exception 2019-07-02 09:58:09 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.4.2/lib/fluent/plugin/buffer.rb:298:in `write'" tag="kubernetes.var.log.containers.weave-net-6bltm_kube-system_weave-c86976ea8158588ae5d1f421f2c64de83facefaeb9bbd3a5667eda64b2ae1bd4.log" 2019-07-02 09:58:09 +0000 [warn]: #0 suppressed same stacktrace

notmaxx commented 5 years ago

The same

2019-07-23 16:51:46 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:throw_exception
2019-07-23 16:51:46 +0000 [warn]: #0 send an error event stream to @ERROR: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/usr/lib/ruby/gems/2.5.0/gems/fluentd-1.2.6/lib/fluent/plugin/buffer.rb:269:in `write'" tag="k.worker-7f5b967d75-7gfgd"
repeatedly commented 5 years ago

BufferOverflowError happens when output speed is slower than incoming traffic. So there are several approaches:

hlakshmi commented 4 years ago

@repeatedly We also see the same errors related to BufferOverflowError. And we got the plugin metrics from the monitor agent which is interesting:

{"plugins":[{"plugin_id":"object:dda9b4","plugin_category":"input","type":"monitor_agent","config":{"@type":"monitor_agent","bind":"0.0.0.0","port":"25220"},"output_plugin":false,"retry_count":null},{"plugin_id":"object:114f888","plugin_category":"input","type":"forward","config":{"@type":"forward","port":"25224"},"output_plugin":false,"retry_count":null},{"plugin_id":"object:e94a6c","plugin_category":"output","type":"null","config":{"@type":"null"},"output_plugin":true,"retry_count":0,"retry":{}},{"plugin_id":"object:e538a0","plugin_category":"output","type":"file","config":{"@type":"file","path":"/xx/xx/xxx/fluentd/${tag[1]}/${tag[0]}/%Y/%m/%d/%H","append":"false","compress":"gzip"},"output_plugin":true,"buffer_queue_length":0,"buffer_total_queued_size":68725542300,"retry_count":58672,"retry":{}}]}

The above shows that the buffer_total_queued_size is > 64GB and we are using file buffer. But the disk utilization of the entire fluentd buffer directory is much less. Is there something which we are missing or is this a bug in fluentd?

ziaudin commented 4 years ago

Hoping to get some guidance on our setup... I am using elastic search for the logs... Initially Fluentd pod was throwing the following error: Worker 0 finished unexpectedly with signal SIGKILL Which was resolved after increasing the memory limit to 2Gi... Then we started getting a different fluentd error: [_cluster-elasticsearch_cluster-elasticsearch_elasticsearch] failed to write data into buffer by buffer overflow action=:throw_exception Attempted to resolve the error by tweaking the buffer settings, now we have the following:

    _buffer:
      timekey: 1m
      timekey_wait: 30s
      timekey_use_utc: true
      chunk_limit_size: 16MB
      flush_mode: interval
      flush_interval: 5s
      flush_thread_count: 8

But I can still see that the buffers size on fluentd is 5.3G (not increasing since last two days) and every so often see the following error: [_cluster-elasticsearch_cluster-elasticsearch_elasticsearch] failed to write data into buffer by buffer overflow action=:throw_exception Buffers size seems to suggest that there are still logs waiting to be pushed to Elastic Search as well as an indication that fluentd is struggling to cope with the logs coming from fluentbit... Please note that I do see some recent logs but not all in elastic search... Appreciate any suggestions...

evalizhangli commented 4 years ago

The same

2019-07-23 16:51:46 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:throw_exception
2019-07-23 16:51:46 +0000 [warn]: #0 send an error event stream to @ERROR: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/usr/lib/ruby/gems/2.5.0/gems/fluentd-1.2.6/lib/fluent/plugin/buffer.rb:269:in `write'" tag="k.worker-7f5b967d75-7gfgd"

I did a pressure test for my service, and comes a lot of log, It make my fluentd plugin return error, How do you fix it? restart the fluentd plugin?

Adhira-Deogade commented 4 years ago

Any updates?

cforce commented 4 years ago

Is there any solution for this? If you continuously loose data this can't be used in production.

srfaytkn commented 3 years ago

same issue

srfaytkn commented 3 years ago
...
    <buffer>
      flush_thread_count 8
      flush_interval 1s
      chunk_limit_size 10M
      queue_limit_length 16
      retry_max_interval 30
      retry_forever true
    </buffer>
...

This solution worked for me.

davinerd commented 3 years ago

Just wanted to say that I've struggled all night on this issue, and the only way to resolve is to scale up your receiving end (I assume Elasticsearch?).

I was using 2 elastic data nodes and just scale up by 1 instantaneously solved the issue.

Just for the sake of completeness, that's what I'm using as buffer:

<buffer>
        @type file
        path /fluentd/log/elastic-buffer
        flush_thread_count 16
        flush_interval 1s
        chunk_limit_size 10M
        queue_limit_length 16
        flush_mode interval
        retry_max_interval 30
        retry_forever true
</buffer>
github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

github-actions[bot] commented 3 years ago

This issue was automatically closed because of stale in 30 days

iamMrGaurav commented 5 months ago

Has anybody resolve this issue ?

daipom commented 5 months ago

This could be the same phenomenon as

theduckling commented 2 months ago

any body? somebody?? heeeelp😥

daipom commented 2 months ago

It appears to be a problem with the buffer settings, but given that there are so many reports, there may be something we can improve. It should be investigated.

lazzio7 commented 1 month ago

Good afternoon,

same issue ..

i am collecting logs within Harvester as cluster output for audit and logging data , then logs are sending to jump server where is running fluentd which is forwarding logs to opensearch.

Its still working several hours until fluentd stop it due to:

  2024-09-04 06:03:14 +0000 [warn]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/io.rb:186:in `on_readable'
2024-09-04 06:03:14.254450433 +0000 fluent.warn: {"error":"#<Fluent::Plugin::Buffer::BufferOverflowError: buffer space has too many data>","location":"/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/buffer.rb:330:in `write'","tag":"kubernetes.var.log.containers.checkmk-cluster-collector-5d756b6fc-qnmdd_checkmk-monitoring_cluster-collector-26c4ec7eda148132d5c1d974fae19ef8d67cadb66918d53e6ac5a0db3a6fb245.log","message":"emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error=\"buffer space has too many data\" location=\"/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/buffer.rb:330:in `write'\" tag=\"kubernetes.var.log.containers.checkmk-cluster-collector-5d756b6fc-qnmdd_checkmk-monitoring_cluster-collector-26c4ec7eda148132d5c1d974fae19ef8d67cadb66918d53e6ac5a0db3a6fb245.log\""}
  2024-09-04 06:03:14 +0000 [warn]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/loop.rb:88:in `run_once'
  2024-09-04 06:03:14 +0000 [warn]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/loop.rb:88:in `run'
  2024-09-04 06:03:14 +0000 [warn]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin_helper/event_loop.rb:93:in `block in start'
  2024-09-04 06:03:14 +0000 [warn]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2024-09-04 06:03:14 +0000 [error]: #0 unexpected error on reading data host="10.16.19.98" port=64831 error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data"
2024-09-04 06:03:14.257656445 +0000 fluent.error: {"host":"10.16.19.98","port":64831,"error":"#<Fluent::Plugin::Buffer::BufferOverflowError: buffer space has too many data>","message":"unexpected error on reading data host=\"10.16.19.98\" port=64831 error_class=Fluent::Plugin::Buffer::BufferOverflowError error=\"buffer space has too many data\""}
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/buffer.rb:330:in `write'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:1095:in `block in handle_stream_simple'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:977:in `write_guard'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:1094:in `handle_stream_simple'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:967:in `execute_chunking'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/output.rb:897:in `emit_buffered'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/event_router.rb:115:in `emit_stream'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/in_forward.rb:318:in `on_message'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/in_forward.rb:226:in `block in handle_connection'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/in_forward.rb:263:in `block (3 levels) in read_messages'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/in_forward.rb:262:in `feed_each'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/in_forward.rb:262:in `block (2 levels) in read_messages'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin/in_forward.rb:271:in `block in read_messages'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin_helper/server.rb:640:in `on_read_without_connection'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/io.rb:123:in `on_readable'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/io.rb:186:in `on_readable'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/loop.rb:88:in `run_once'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/loop.rb:88:in `run'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin_helper/event_loop.rb:93:in `block in start'
  2024-09-04 06:03:14 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.5/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create' 

i tried several option within buffer on harvester output buffer configuration same as for jump fluentd configuration side, but still buffer errors.

also when forwarding start again due to restarting fluentd service, there are error 400 which is filling fluentd.log : ( but i think this is not problem as in opensearch i can see data, it should maybe about mapping or something else.. )

2024-09-04 09:27:55.389466767 +0000 fluent.warn: {"error":"#<Fluent::Plugin::OpenSearchErrorHandler::OpenSearchError: 400 - Rejected by OpenSearch>","location":null,"tag":"kubernetes.var.log.containers.checkmk-node-collector-container-metrics-wd6wm_checkmk-monitoring_cadvisor-3597333cf11b7dd09210b34a3d97b7dac294ae2a017f4cc130b1a86982cb2f60.log","time":1725441524,"record":{"stream":"stderr","logtag":"F","message":"W0904 09:18:44.221599       1 machine_libipmctl.go:64] There are no NVM devices!","kubernetes":{"pod_name":"checkmk-node-collector-container-metrics-wd6wm","namespace_name":"checkmk-monitoring","pod_id":"f681dd37-9fc3-445d-b56a-b9294d3a3dd9","labels":{"app":"checkmk-node-collector-container-metrics","app.kubernetes.io/instance":"checkmk","app.kubernetes.io/name":"checkmk","component":"checkmk-node-collector","controller-revision-hash":"5b48d75884","pod-template-generation":"2"},"annotations":{"cni.projectcalico.org/containerID":"3a74fe8ce9ae0ae712212203d46738eac804258562f82ec04607b405d20e23cd","cni.projectcalico.org/podIP":"10.52.2.192/32","cni.projectcalico.org/podIPs":"10.52.2.192/32","k8s.v1.cni.cncf.io/network-status":"[{\n    \"name\": \"k8s-pod-network\",\n    \"ips\": [\n        \"10.52.2.192\"\n    ],\n    \"default\": true,\n    \"dns\": {}\n}]","k8s.v1.cni.cncf.io/networks-status":"[{\n    \"name\": \"k8s-pod-network\",\n    \"ips\": [\n        \"10.52.2.192\"\n    ],\n    \"default\": true,\n    \"dns\": {}\n}]"},"host":"sissach-harv3","container_name":"cadvisor","docker_id":"3597333cf11b7dd09210b34a3d97b7dac294ae2a017f4cc130b1a86982cb2f60","container_hash":"docker.io/checkmk/cadvisor-patched@sha256:b0fe7daf1ab6beeb28abef175bcce623244be6bf59237fcf72b6af3d62e437f1","container_image":"docker.io/checkmk/cadvisor-patched:1.5.1"}},"message":"dump an error event: error_class=Fluent::Plugin::OpenSearchErrorHandler::OpenSearchError error=\"400 - Rejected by OpenSearch\" location=nil tag=\"kubernetes.var.log.containers.checkmk-node-collector-container-metrics-wd6wm_checkmk-monitoring_cadvisor-3597333cf11b7dd09210b34a3d97b7dac294ae2a017f4cc130b1a86982cb2f60.log\" time=2024-09-04 09:18:44.221675829 +0000 record={\"stream\"=>\"stderr\", \"logtag\"=>\"F\", \"message\"=>\"W0904 09:18:44.221599       1 machine_libipmctl.go:64] There are no NVM devices!\", \"kubernetes\"=>{\"pod_name\"=>\"checkmk-node-collector-container-metrics-wd6wm\", \"namespace_name\"=>\"checkmk-monitoring\", \"pod_id\"=>\"f681dd37-9fc3-445d-b56a-b9294d3a3dd9\", \"labels\"=>{\"app\"=>\"checkmk-node-collector-container-metrics\", \"app.kubernetes.io/instance\"=>\"checkmk\", \"app.kubernetes.io/name\"=>\"checkmk\", \"component\"=>\"checkmk-node-collector\", \"controller-revision-hash\"=>\"5b48d75884\", \"pod-template-generation\"=>\"2\"}, \"annotations\"=>{\"cni.projectcalico.org/containerID\"=>\"3a74fe8ce9ae0ae712212203d46738eac804258562f82ec04607b405d20e23cd\", \"cni.projectcalico.org/podIP\"=>\"10.52.2.192/32\", \"cni.projectcalico.org/podIPs\"=>\"10.52.2.192/32\", \"k8s.v1.cni.cncf.io/network-status\"=>\"[{\\n    \\\"name\\\": \\\"k8s-pod-network\\\",\\n    \\\"ips\\\": [\\n        \\\"10.52.2.192\\\"\\n    ],\\n    \\\"default\\\": true,\\n    \\\"dns\\\": {}\\n}]\", \"k8s.v1.cni.cncf.io/networks-status\"=>\"[{\\n    \\\"name\\\": \\\"k8s-pod-network\\\",\\n    \\\"ips\\\": [\\n        \\\"10.52.2.192\\\"\\n    ],\\n    \\\"default\\\": true,\\n    \\\"dns\\\": {}\\n}]\"}, \"host\"=>\"sissach-harv3\", \"container_name\"=>\"cadvisor\", \"docker_id\"=>\"3597333cf11b7dd09210b34a3d97b7dac294ae2a017f4cc130b1a86982cb2f60\", \"container_hash\"=>\"docker.io/checkmk/cadvisor-patched@sha256:b0fe7daf1ab6beeb28abef175bcce623244be6bf59237fcf72b6af3d62e437f1\", \"container_image\"=>\"docker.io/checkmk/cadvisor-patched:1.5.1\"}}"}
2024-09-04 09:27:55.389995243 +0000 fluent.warn: {"error":"#<Fluent::Plugin::OpenSearchErrorHandler::OpenSearchError: 400 - Rejected by OpenSearch>","location":null,"tag":"kubernetes.var.log.containers.checkmk-node-collector-container-metrics-wd6wm_checkmk-monitoring_cadvisor-3597333cf11b7dd09210b34a3d97b7dac294ae2a017f4cc130b1a86982cb2f60.log","time":1725441525,"record":{"stream":"stderr","logtag":"F","message":"W0904 09:18:45.198464       1 info.go:53] Couldn't collect info from any of the files in \"/etc/machine-id,/var/lib/dbus/machine-id\"","kubernetes":{"pod_name":"checkmk-node-collector-container-metrics-wd6wm","namespace_name":"checkmk-monitoring","pod_id":"f681dd37-9fc3-445d-b56a-b9294d3a3dd9","labels":{"app":"checkmk-node-collector-container-metrics","app.kubernetes.io/instance":"checkmk","app.kubernetes.io/name":"checkmk","component":"checkmk-node-collector","controller-revision-hash":"5b48d75884","pod-template-generation":"2"},"annotations":{"cni.projectcalico.org/containerID":"3a74fe8ce9ae0ae712212203d46738eac804258562f82ec04607b405d20e23cd","cni.projectcalico.org/podIP":"10.52.2.192/32","cni.projectcalico.org/podIPs":"10.52.2.192/32","k8s.v1.cni.cncf.io/network-status":"[{\n    \"name\": \"k8s-pod-network\",\n    \"ips\": [\n        \"10.52.2.192\"\n    ],\n    \"default\": true,\n    \"dns\": {}\n}]","k8s.v1.cni.cncf.io/networks-status":"[{\n    \"name\": \"k8s-pod-network\",\n    \"ips\": [\n        \"10.52.2.192\"\n    ],\n    \"default\": true,\n    \"dns\": {}\n}]"},"host":"sissach-harv3","container_name":"cadvisor","docker_id":"3597333cf11b7dd09210b34a3d97b7dac294ae2a017f4cc130b1a86982cb2f60","container_hash":"docker.io/checkmk/cadvisor-patched@sha256:b0fe7daf1ab6beeb28abef175bcce623244be6bf59237fcf72b6af3d62e437f1","container_image":"docker.io/checkmk/cadvisor-patched:1.5.1"}},"message":"dump an error event: error_class=Fluent::Plugin::OpenSearchErrorHandler::OpenSearchError error=\"400 - Rejected by OpenSearch\" location=nil tag=\"kubernetes.var.log.containers.checkmk-node-collector-container-metrics-wd6wm_checkmk-monitoring_cadvisor-3597333cf11b7dd09210b34a3d97b7dac294ae2a017f4cc130b1a86982cb2f60.log\" time=2024-09-04 09:18:45.198569000 +0000 record={\"stream\"=>\"stderr\", \"logtag\"=>\"F\", \"message\"=>\"W0904 09:18:45.198464       1 info.go:53] Couldn't collect info from any of the files in \\\"/etc/machine-id,/var/lib/dbus/machine-id\\\"\", \"kubernetes\"=>{\"pod_name\"=>\"checkmk-node-collector-container-metrics-wd6wm\", \"namespace_name\"=>\"checkmk-monitoring\", \"pod_id\"=>\"f681dd37-9fc3-445d-b56a-b9294d3a3dd9\", \"labels\"=>{\"app\"=>\"checkmk-node-collector-container-metrics\", \"app.kubernetes.io/instance\"=>\"checkmk\", \"app.kubernetes.io/name\"=>\"checkmk\", \"component\"=>\"checkmk-node-collector\", \"controller-revision-hash\"=>\"5b48d75884\", \"pod-template-generation\"=>\"2\"}, \"annotations\"=>{\"cni.projectcalico.org/containerID\"=>\"3a74fe8ce9ae0ae712212203d46738eac804258562f82ec04607b405d20e23cd\", \"cni.projectcalico.org/podIP\"=>\"10.52.2.192/32\", \"cni.projectcalico.org/podIPs\"=>\"10.52.2.192/32\", \"k8s.v1.cni.cncf.io/network-status\"=>\"[{\\n    \\\"name\\\": \\\"k8s-pod-network\\\",\\n    \\\"ips\\\": [\\n        \\\"10.52.2.192\\\"\\n    ],\\n    \\\"default\\\": true,\\n    \\\"dns\\\": {}\\n}]\", \"k8s.v1.cni.cncf.io/networks-status\"=>\"[{\\n    \\\"name\\\": \\\"k8s-pod-network\\\",\\n    \\\"ips\\\": [\\n        \\\"10.52.2.192\\\"\\n    ],\\n    \\\"default\\\": true,\\n    \\\"dns\\\": {}\\n}]\"}, \"host\"=>\"sissach-harv3\", \"container_name\"=>\"cadvisor\", \"docker_id\"=>\"3597333cf11b7dd09210b34a3d97b7dac294ae2a017f4cc130b1a86982cb2f60\", \"container_hash\"=>\"docker.io/checkmk/cadvisor-patched@sha256:b0fe7daf1ab6beeb28abef175bcce623244be6bf59237fcf72b6af3d62e437f1\", \"container_image\"=>\"docker.io/checkmk/cadvisor-patched:1.5.1\"}}"}

Thanks for any advice.

update:

as a workaround restart fluentd on jump server each 12hours helps so far.

sam-hieken commented 1 month ago

This issue is still open, is there no official solution / fix yet? Like cforce said, getting an error like this incredibly concerning in production...