Fluentd Liveness/Readiness Probes failing helm chart version 0.2.12

jessectl commented 3 years ago

### Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  42m                   default-scheduler  Successfully assigned kube-system/fluentd-wd5fg to ip-xx-xx-xx-xx.ec2.internal
  Normal   Pulled     41m                   kubelet            Successfully pulled image "xxx.dkr.ecr.us-east-1.amazonaws.com/nite/fluentd:1.0.1" in 145.419191ms
  Normal   Created    41m (x2 over 41m)     kubelet            Created container fluentd
  Normal   Started    41m (x2 over 41m)     kubelet            Started container fluentd
  Normal   Pulled     41m                   kubelet            Successfully pulled image "xxx.dkr.ecr.us-east.xxx/fluentd:1.0.1" in 309.279381ms
  Normal   Killing    40m (x2 over 41m)     kubelet            Container fluentd failed liveness probe, will be restarted
  Warning  Unhealthy  40m (x8 over 41m)     kubelet            Readiness probe failed: Get "http://xx.xx.xx.xxx:24231/metrics": dial tcp xx.xx.xxx.xx:24231: connect: connection refused
  Normal   Pulling    40m (x3 over 41m)     kubelet            Pulling image "xxxx.dkr.ecr.us-east-1.xxx/fluentd:1.0.1"
  Warning  Unhealthy  21m (x31 over 41m)    kubelet            Liveness probe failed: Get "http://xx.xx.xx.xx:24231/metrics": dial tcp xx.xx.xxx.xxx:24231: connect: connection refused
  Warning  BackOff    117s (x140 over 38m)  kubelet            Back-off restarting failed container

EKS Version:

v1.19.6-eks

patrick-stephens commented 2 years ago

This seems to be an issue with the default install, can reproduce on KIND as well:

kind create cluster
helm install fluentd fluent/fluentd

patrick-stephens commented 2 years ago

The issue seems to be that without a valid deployment of Elastic to talk to then it fails the probes. If you deploy the default values then it wants an elasticsearch-master in the current namespace. I can get it to function by just deploying the Helm chart for Elasticsearch as per: https://github.com/calyptia/fluent-bit-devtools/blob/main/deploy-elastic.sh

kind create cluster
git clone https://github.com/calyptia/fluent-bit-devtools.git ./github/calyptia/fluent-bit-devtools/
ES_NAMESPACE=default ./github/calyptia/fluent-bit-devtools/deploy-elastic.sh
helm install fluentd fluent/fluentd

The logs without Elasticsearch deployed show errors trying to connect:

2022-02-04 12:10:18 +0000 [info]: adding match in @FLUENT_LOG pattern="**" type="null"
2022-02-04 12:10:18 +0000 [info]: adding match in @KUBERNETES pattern="kubernetes.var.log.containers.fluentd**" type="relabel"
2022-02-04 12:10:18 +0000 [info]: adding filter in @KUBERNETES pattern="kubernetes.**" type="kubernetes_metadata"
2022-02-04 12:10:18 +0000 [info]: adding match in @KUBERNETES pattern="**" type="relabel"
2022-02-04 12:10:18 +0000 [info]: adding filter in @DISPATCH pattern="**" type="prometheus"
2022-02-04 12:10:18 +0000 [info]: adding match in @DISPATCH pattern="**" type="relabel"
2022-02-04 12:10:18 +0000 [info]: adding match in @OUTPUT pattern="**" type="elasticsearch"
2022-02-04 12:10:20 +0000 [warn]: #0 Could not communicate to Elasticsearch, resetting connection and trying again. no address for elasticsearch-master (Resolv::ResolvError)
2022-02-04 12:10:20 +0000 [warn]: #0 Remaining retry: 14. Retry to communicate after 2 second(s).
2022-02-04 12:10:24 +0000 [warn]: #0 Could not communicate to Elasticsearch, resetting connection and trying again. no address for elasticsearch-master (Resolv::ResolvError)
2022-02-04 12:10:24 +0000 [warn]: #0 Remaining retry: 13. Retry to communicate after 4 second(s).
2022-02-04 12:10:32 +0000 [warn]: #0 Could not communicate to Elasticsearch, resetting connection and trying again. no address for elasticsearch-master (Resolv::ResolvError)
2022-02-04 12:10:32 +0000 [warn]: #0 Remaining retry: 12. Retry to communicate after 8 second(s).
2022-02-04 12:10:47 +0000 [info]: Received graceful stop
2022-02-04 12:10:48 +0000 [warn]: #0 Could not communicate to Elasticsearch, resetting connection and trying again. no address for elasticsearch-master (Resolv::ResolvError)
2022-02-04 12:10:48 +0000 [warn]: #0 Remaining retry: 11. Retry to communicate after 16 second(s).

With Elasticsearch, it shows more work going on and is responding to probes:

2022-02-04 13:25:01 +0000 [info]: starting fluentd-1.12.4 pid=7 ruby="2.6.7"
2022-02-04 13:25:01 +0000 [info]: spawn command to main:  cmdline=["/usr/local/bin/ruby", "-Eascii-8bit:ascii-8bit", "/fluentd/vendor/bundle/ruby/2.6.0/bin/fluentd", "-c", "/fluentd/etc/../../../etc/fluent/fluent.conf", "-p", "/fluentd/plugins", "--gemfile", "/fluentd/Gemfile", "-r", "/fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_simple_sniffer.rb", "--under-supervisor"]
2022-02-04 13:25:02 +0000 [info]: adding match in @FLUENT_LOG pattern="**" type="null"
2022-02-04 13:25:02 +0000 [info]: adding match in @KUBERNETES pattern="kubernetes.var.log.containers.fluentd**" type="relabel"
2022-02-04 13:25:02 +0000 [info]: adding filter in @KUBERNETES pattern="kubernetes.**" type="kubernetes_metadata"
2022-02-04 13:25:02 +0000 [info]: adding match in @KUBERNETES pattern="**" type="relabel"
2022-02-04 13:25:02 +0000 [info]: adding filter in @DISPATCH pattern="**" type="prometheus"
2022-02-04 13:25:02 +0000 [info]: adding match in @DISPATCH pattern="**" type="relabel"
2022-02-04 13:25:02 +0000 [info]: adding match in @OUTPUT pattern="**" type="elasticsearch"
2022-02-04 13:25:02 +0000 [warn]: #0 Detected ES 7.x: `_doc` will be used as the document `_type`.
warning: 299 Elasticsearch-7.16.3-4e6e4eab2297e949ec994e688dad46290d018022 "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.16/security-minimal-setup.html to enable security."
2022-02-04 13:25:02 +0000 [info]: adding source type="tail"
2022-02-04 13:25:02 +0000 [info]: adding source type="prometheus"
2022-02-04 13:25:02 +0000 [info]: adding source type="prometheus_monitor"
2022-02-04 13:25:02 +0000 [info]: adding source type="prometheus_output_monitor"
2022-02-04 13:25:02 +0000 [info]: #0 starting fluentd worker pid=18 ppid=7 worker=0
2022-02-04 13:25:02 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/kube-proxy-xcqpb_kube-system_kube-proxy-b99becf3fc759b549d60c05ee0af95b9f0da208b080bdf2617f2ef85a97b514b.log
2022-02-04 13:25:02 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/kube-apiserver-kind-control-plane_kube-system_kube-apiserver-4bf5c88cc9439552051a58b5392f7a5693a2b4b4abf0670d910aca01934ae7af.log
2022-02-04 13:25:02 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/etcd-kind-control-plane_kube-system_etcd-cec408087c25acfbf2b11f95cd577df56dbb1ef348939a40341113524889082f.log
2022-02-04 13:25:02 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/coredns-558bd4d5db-44ph8_kube-system_coredns-87926b7ea1f0f58635be68d3c86680c5d35dd1289d871a46742b90f4deecc53e.log
2022-02-04 13:25:02 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/coredns-558bd4d5db-g6cl5_kube-system_coredns-44e2f07ffb61a34b05f91d3a1ccd8ccc8526c511cb64c7c8a2b693babac99b75.log
2022-02-04 13:25:02 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/elasticsearch-master-0_default_configure-sysctl-beccfb1c4781a40d7a08afb835995dfcc7eb56a7883d34bf4548802ae06f8cc2.log
2022-02-04 13:25:02 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/kindnet-722m8_kube-system_kindnet-cni-647d41b1cd2890532ed93a7c5a4a2dab3fddadfce0e2d0fd69e0f29982793c9f.log
2022-02-04 13:25:02 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/fluentd-lff9b_default_fluentd-a0a83822759937595f118c9dd8bde77ecc825dbe2a4320a465d8751001fc41c7.log
2022-02-04 13:25:02 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/elasticsearch-master-0_default_elasticsearch-4b897045b535ce0301f1565e39837ea24c12e6586373ed1378476f1a4c77cb35.log
2022-02-04 13:25:02 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/kube-controller-manager-kind-control-plane_kube-system_kube-controller-manager-571ba02726d6857a7f4426ef241d54dfa312df202f07023fa186dd23ae52095e.log
2022-02-04 13:25:02 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/kube-scheduler-kind-control-plane_kube-system_kube-scheduler-ef36c44dfc76041dfa7e992153d37b651003dcc0ddb709b2348553f709463406.log
2022-02-04 13:25:02 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/local-path-provisioner-547f784dff-sjzzp_local-path-storage_local-path-provisioner-0636e525ac4ecab7c53d99c9f049fb66e900b9f4a1ccbf1fb56e0aa141c821e4.log
2022-02-04 13:25:02 +0000 [info]: #0 fluentd worker is now running worker=0
2022-02-04 13:25:33 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 5, pod_cache_size: 12, namespace_cache_miss: 8, pod_cache_watch_updates: 1, pod_cache_api_updates: 11, id_cache_miss: 11, pod_cache_host_updates: 12, namespace_cache_host_updates: 5
warning: 299 Elasticsearch-7.16.3-4e6e4eab2297e949ec994e688dad46290d018022 "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.16/security-minimal-setup.html to enable security.", 299 Elasticsearch-7.16.3-4e6e4eab2297e949ec994e688dad46290d018022 "[types removal] Specifying types in bulk requests is deprecated."

mickeypash commented 2 years ago

I can confirm that @patrick-stephens 's code worked for me 👍

patrick-stephens commented 2 years ago

It seems to be a more general Fluentd problem in that when it cannot connect to its output it is also not servicing the metric port. It just shows up here because that is then used as the probe.

dioguerra commented 2 years ago

Hello, I found the same problem, but with IPv6 enabled clusters.

Can anyone reproduce? I'm thinking on creating a MR to use https://docs.fluentd.org/monitoring-fluentd instead of using /metrics

patrick-stephens commented 2 years ago

Yes, see above. You need the output available for Elastic otherwise it'll fail the metrics probe.

gangefors commented 1 year ago

Edit: Ignore this comment, it's not valid. I was using a custom image with additional plugins that was not based on the fluentd's image but instead bitnami's fluentd image. I just didn't notice it at first. I should be using bitnamis Helm chart. Sorry for the noise.

Since I don't want to deploy ES just to get fluentd to work in my cluster I looked into the monitoring agent solution suggested by @dioguerra.

I managed to get the Helm chart deployment to work using the monitor_agent in the liveness/readiness probes by using these values.

livenessProbe:
  httpGet:
    path: /api/plugins.json
    port: 24220
readinessProbe:
  httpGet:
    path: /api/plugins.json
    port: 24220
fileConfigs:
  monitor_agent.conf: |-
    <source>
      @type monitor_agent
      bind 0.0.0.0
      port 24220
    </source>

I have no clue if they are as representative as the /metrics endpoint, but at least the pods are considered ready using this config.

dioguerra commented 1 year ago

IS this still happening? It should be default fixed...

Or was it dropped? Otherwise a note must be dropped

gangefors commented 1 year ago

@dioguerra

IS this still happening? It should be default fixed...

Or was it dropped? Otherwise a note must be dropped

See my edited comment. Sorry for the noise.

patrickshan commented 5 days ago

I've just tried fluent/fluentd-kubernetes-daemonset:v1.16-debian-opensearch-amd64-2 which seems to still have the problem which causes pod under crashingloop. After switching to /api/plugins.json endpoint for both liveness and readiness probes, it stabilized itself.

fluent / helm-charts

Fluentd Liveness/Readiness Probes failing helm chart version 0.2.12 #176