fluent / fluent-plugin-prometheus

A fluent plugin that collects metrics and exposes for Prometheus.
Apache License 2.0
258 stars 79 forks source link

Input plugin does not expose port on arbitrary node #92

Open tetramin opened 5 years ago

tetramin commented 5 years ago

I'm starting fluentd with prometheus plugin as a daemonset in GKE. Often, when the container is started, port 24231 does not open, although on other nodes the ports are normally exposed. When I enter the ss -lnt command, I do not see open ports in container.

There are no suspicious messages in the logs even in trace mode.

How I can debug this?

SergeyD commented 5 years ago

We have the same issue on k8s clusters running on bare-metal instances: 7 of 54 containers with identical config do not listening on port without any log messages (debug level used).

SergeyD commented 5 years ago

I did some experiments by editing fluentd configuration file directly in running containers and applying changes by sending HUP signal to fluentd process to re-load config.

Two observations:

  1. Disabling in_tail input (targeting k8s container) logs resolves issue with binding after config reload.
  2. With enabled in_tail and elasticsearch output replaced with
    <match kube.**>
    @type null 
    </match>

    plugin was able to bind after a short time. After this I reverted output back to elasticsearch plugin, reloaded config - and plugin still able to bind.

Looks like big input queue/pressure prevents plugin from bind on startup time.

SergeyD commented 5 years ago

I've find issue and solution for my case. We have two inputs configured for our fluentd instances. One of them is in_tail for k8s logs:

<source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/es-containers.log.pos
      tag kubernetes.*
      format none
      refresh_interval 60
</source>

From startup logs in debug mode it seen that in_tail found around 80 tailing paths separated by comma:

2019-05-25 16:24:34 +0000 [debug]: #0 tailing paths: target = /var/log/containers/XXX.log,/var/log/containers/YYY.log,...,....

Next message is:

2019-05-25 16:24:34 +0000 [info]: #0 following tail of /var/log/containers/XXX.log

And no more following tail of ... messages, so obviously it stuck on file XXX.log which is around 7Gb in my case.

I've added skip_refresh_on_startup true setting for in_tail plugin, and now monitoring http source able to bind on startup.

repeatedly commented 5 years ago

Yes. in_tail launches and starts file watchers during startup phase by default. This is no problem for almost cases but if you have large files, it takes long time to process it. skip_refresh_on_startup avoids this problem by disabling watcher launch at startup.