No error or msg at all when tail input not readable

brsolomon-deloitte commented 2 years ago

Bug Report

Describe the bug

fluent-bit fails to emanate any error, warning, or message at all, if the file(s) in tail are not readable by the invoking user.

To Reproduce

Use IronBank fluent-bit Docker image and official fluent-bit helm chart. Helm chart parameter for image:

  image:
    repository: registry1.dso.mil/ironbank/opensource/fluent/fluent-bit
    tag: "1.9.2"
  imagePullSecrets:
    - name: <redacted>

for Kubernetes daemonset:

sh-4.4$ grep INPUT -A 10 /fluent-bit/etc/fluent-bit.conf
[INPUT]
    Name tail
    Path /var/log/containers/*.log
    multiline.parser docker, cri
    Tag kube.*
    Mem_Buf_Limit 5MB
    Skip_Long_Lines On
    Buffer_Chunk_Size 64KB
    Buffer_Max_Size 128KB

Expected behavior

Give me any indication at all there is a problem.

Your Environment

Version used: fluent-bit 1.92
Configuration: see above and below
Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.21
Helm chart version 0.19.20

Full Helm config from charts/fluent-bit/values.yaml:

fluent-bit:
  env:
    - name: ELASTICSEARCH_PASSWORD
      valueFrom:
        secretKeyRef:
          name: REDACTED
          key: REDACTED
  # We copy/modify several of the upstream default configs to increase buffer sizes
  # https://github.com/fluent/helm-charts/blob/main/charts/fluent-bit/values.yaml
  config:
    inputs: |
      [INPUT]
          Name tail
          Path /var/log/containers/*.log
          multiline.parser docker, cri
          Tag kube.*
          Mem_Buf_Limit 5MB
          Skip_Long_Lines On
          Buffer_Chunk_Size 64KB
          Buffer_Max_Size 128KB
      [INPUT]
          Name systemd
          Tag host.*
          Systemd_Filter _SYSTEMD_UNIT=kubelet.service
          Read_From_Tail On
    filters: |
      [FILTER]
          Name kubernetes
          Match kube.*
          Merge_Log On
          Keep_Log Off
          K8S-Logging.Parser On
          K8S-Logging.Exclude On
          Buffer_Size 256KB
    outputs: |
      [OUTPUT]
          Name es
          Match kube.*
          Host monitoring-es-http
          Retry_Limit False
          Index fluent-bit-kube
          HTTP_User REDACTED
          HTTP_Passwd ${ELASTICSEARCH_PASSWORD}
          tls On
          tls.verify On
          tls.ca_file /usr/local/share/ca-certificates/elasticsearch.crt
          Buffer_Size 128KB
          net.keepalive false
          Replace_Dots On
      [OUTPUT]
          Name es
          Match host.*
          Host monitoring-es-http
          Retry_Limit False
          Index fluent-bit-host
          HTTP_User REDACTED
          HTTP_Passwd ${ELASTICSEARCH_PASSWORD}
          tls On
          tls.verify On
          tls.ca_file /usr/local/share/ca-certificates/elasticsearch.crt
          Buffer_Size 128KB
          net.keepalive false
          Replace_Dots On
  daemonSetVolumes:
    - hostPath:
        path: /var/log
        type: ''
      name: varlog
    - hostPath:
        path: /var/lib/docker/containers
        type: ''
      name: varlibdockercontainers
    - hostPath:
        path: /etc/machine-id
        type: File
      name: etcmachineid
    - name: es-certs
      secret:
        secretName: REDACTED
        items:
          - key: ca.crt
            path: elasticsearch.crt
  daemonSetVolumeMounts:
    - mountPath: /var/log
      name: varlog
    - mountPath: /var/lib/docker/containers
      name: varlibdockercontainers
      readOnly: true
    - mountPath: /etc/machine-id
      name: etcmachineid
      readOnly: true
    - name: es-certs
      mountPath: /usr/local/share/ca-certificates
      readOnly: true

Now exec into container:

kubectl exec -it -n redacted ds/fluent-bit-logging -- sh

Then:

sh-4.4$ id
uid=1000(fluent) gid=1000(fluent) groups=1000(fluent)
sh-4.4$ ls -ld /var/log
drwxr-xr-x 10 root root 4096 Apr 22 08:07 /var/log
sh-4.4$ ls -la /var/log | head  
total 26676
drwxr-xr-x  10 root             root                 4096 Apr 22 08:07 .
drwxr-xr-x   1 root             root                   17 Nov  3 13:50 ..
-rw-r--r--   1 root             root                  591 Oct 25 21:13 alternatives.log
drwxr-xr-x   3 root             root                   17 Oct 25 21:13 amazon
drwx------   2 root             root                   42 Mar 12 21:38 audit
drwxr-xr-x   2 root             root                  129 Apr  4 02:29 aws-routed-eni
-rw-r--r--   1 root             root                   28 Mar 19 14:16 bigfix-install.log
-rw-------   1 root             root                  512 Oct 25 21:13 boot.log
-rw-------   1 root             utmp                    0 Apr  1 07:47 btmp
sh-4.4$ pwd
/fluent-bit

As shown here the IronBank image uses a user of fluent while /var/log files are only readable by root. Sure, one fix is the correct the read permission or owner itself. However, the purpose of this issue is to show that fluent-bit doesn't produce any useful logs to point out this error. Pods are healthy and this is the only log i get:

2022/04/22 13:38:09] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2022/04/22 13:38:09] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2022/04/22 13:38:09] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2022/04/22 13:38:09] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2022/04/22 13:38:09] [ info] [output:es:es.0] worker #0 started
[2022/04/22 13:38:09] [ info] [output:es:es.0] worker #1 started
[2022/04/22 13:38:09] [ info] [output:es:es.1] worker #0 started
[2022/04/22 13:38:09] [ info] [output:es:es.1] worker #1 started
[2022/04/22 13:38:09] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2022/04/22 13:38:09] [ info] [sp] stream processor started

patrick-stephens commented 2 years ago

What if you up the log_level in the service configuration? This is how I generally debug file permission or mount point errors (e.g. symlinks to a non-existent mount point). Similarly for issues in other plugins, first step is to up the logging level to see what is happening in more detail.

Fluent Bit asks for the list of files it is allowed to see and gets those it is allowed to see. How would it know there are other files to see? What if there was a mix of permissions so it got some files but not others, how would it know it did not have all? I don't think the input plugin could or should be able to tell you this, it can merely tell you what it is told/allowed to see. It does not know there is an issue so it cannot report an issue.

The overall log level can give you that extra information though - we don't want the logs to be full of extra debug either when it is running with a correct configuration. Some log files may be deliberately excluded as well from selection (e.g. by time of last update) so I'm not sure there is a general check that could be done to pick up missed files, particularly when you should not even be able to know of their existence from a security perspective.

However, if you have a suggestion for how we could detect this specific failure generally then that would be ace! :+1:

brsolomon-deloitte commented 2 years ago

In the example above, fluent-bit is given a tail input for /var/log/containers/*.log. (This is ultimately a symlink to /var/log/.) It seems to me that a reasonable behavior would be to warn about unreadable files that match the input glob or exact file path, at the default log level.

In this case that seems like it would be quite possible to enabled. /var/log is mode 0755, owned by root:root, and the various /var/log/*.log are mode 0600, owned by root:root. A non-root user can detect that those files exist because of the directory's x bit, but also discern that it is prohibited from reading them, e.g. stat /var/log/foo.log succeeds while cat /var/log/foo.log or test -r /var/log/foo.log fails. Having to turn on a more verbose logging level above the default doesn't seem like it should be necessary. It seems sensical that not being able to read a file that fluent-bit has been told specifically to read through a glob seems like something that should be logged at the INFO level.

patrick-stephens commented 2 years ago

Ah right, in this specific case there is a test that could be made for it. I guess there may be additional performance concerns on these checks at scale (e.g. when thousands of rapidly rotating log files or other pathological cases) but that is something we can test and/or have a configuration option for potentially.

Please submit a PR to cover the changes, at the moment the current guidance is using the existing additional logging to detect it which is not perfect as you say.

brsolomon-deloitte commented 2 years ago

Ah right, in this specific case there is a test that could be made for it.

Per your original response, isn't there already a check happening, but one that is logged at a more verbose log level but not at the default log level? My proposal here would be to emanate a warning for this check at the default log level.

patrick-stephens commented 2 years ago

Possibly related to #2526 although the use cases may be the opposite: that one wants to reduce log noise when folders are empty and this wants to trigger more logs on misconfiguration.

fluent / fluent-bit

No error or msg at all when tail input not readable #5346

Bug Report