fluent / helm-charts

Helm Charts for Fluentd and Fluent Bit
Apache License 2.0
374 stars 443 forks source link

RFE: provide realistic runAsNonRoot security context values for fluent-bit #330

Open joebowbeer opened 1 year ago

joebowbeer commented 1 year ago

Provide realistic values for running fluent-bit as a non-root user.

The security context comments in values.yaml are not usable:

podSecurityContext: {}
#   fsGroup: 2000

securityContext: {}
#   capabilities:
#     drop:
#     - ALL
#   readOnlyRootFilesystem: true
#   runAsNonRoot: true
#   runAsUser: 1000

Issues:

  1. The user and group ids do not exist in the fluent-bit image. AFAICT the image is based on distroless/cc-debian11 which runs as root - though it does define a nonroot user id (65532:65532).
  2. All the files in the image are owned by 0:0 (root) so runAsNonRoot probably won't suffice, at least not without some additional capabilities, such as FOWNER
  3. Typical deployments will enable storage.path (e.b., /var/fluent-bit/state/flb-storage/), which appears to need a hostPath

Related:

razorsk8jz commented 1 year ago

I was able to get aws-for-fluent-bit running with the following permissions - I have not seen any issues yet but will let you know if I do. I was also unnable to get running with nonroot as it does not appear fluent-bit can run unless running as user 0

podSecurityContext:
  runAsUser: 0
  seccompProfile:
    type: RuntimeDefault
containerSecurityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  privileged: false
  capabilities:
    drop:
    - ALL
maurosls commented 11 months ago

So running as a Non-Root user isn`t an option at the moment? Can we confirm this?

pentago commented 6 months ago

I'd love to be able to tune securityContext for running process as non-root, most importantly, the non-root/nobody user present in distroless image.

Does this potential feature disallow fluent-bit from reading log files or there's additional complexity I'm not aware of?

gsmith-sas commented 5 months ago

I think I have been able to get Fluent Bit running as a non-root user AND still use a hostPath volume for the tail database and buffering. But I'd like some feedback on my approach in case I'm missing something.

Implementing this required 3 sets of changes to the Fluent Bit Helm chart.

extraVolumes:

Fluent Bit has been running in this configuration for the last few hours without any problems as far as I can tell. Log messages are being collected and forwarded onto their destination (OpenSearch) with no obvious regression in the number of log messages processed. The Fluent Bit pod logs don't show any new ERROR or WARNING messages.

I've SSH'ed onto the Kubernetes nodes and things look "right":

[root@k8s-n1 /]# ps -ef|grep fluent
3301     1960622 1960497  2 17:46 ?        00:00:34 /fluent-bit/bin/fluent-bit --workdir=/fluent-bit/etc --config=/fluent-bit/etc/conf/fluent-bit.conf

[root@k8s-n1 /]# ls -l /var/log
{snip}
drwxr-xr-x   4   3301   3301        93 Apr  5 17:46 fb-storage

[root@k8s-n1 /]# ls -l /var/log/fb-storage/
total 4148
drwxr-xr-x 2 3301 root       6 Apr  5 18:11 tail.1
drwxr-xr-x 2 3301 root       6 Apr  5 18:11 tail.2
-rw-r--r-- 1 3301 root   20480 Apr  5 18:08 fb.db
-rw-r--r-- 1 3301 root   32768 Apr  5 18:11 fb.db-shm
-rw-r--r-- 1 3301 root 4120032 Apr  5 18:11 fb.db-wal

Hmmm, just noticed that the files within the FB storage directory are owned by user '3301' but the group is 'root'. I thought the fsGroup in the securityContext would have forced that to set the group to '3301'. But I think I can live with that.

Anyone see something wrong about this approach? Any hidden things I may be missing?

NOTE: I'm working with Fluent Bit 2.2.2 and Fluent Bit Helm chart version 0.43.0.

@joebowbeer If you get some time, please give this a try and see if it works for you. @PettitWesley Not sure if this would work with the AWS version of Fluent Bit and Helm chart. Let us know if you get a chance to try it out.

onap4105 commented 4 months ago

I think I have been able to get Fluent Bit running as a non-root user AND still use a hostPath volume for the tail database and buffering. But I'd like some feedback on my approach in case I'm missing something.

Implementing this required 3 sets of changes to the Fluent Bit Helm chart.

  • Added a securityContext
securityContext:
  runAsUser: 3301
  fsGroup: 3301
  readOnlyRootFilesystem: true
  privileged: false
  capabilities:
    drop: ["ALL"]
    add: ["FOWNER"]`
  • Added an extra volume/mount
extraVolumeMounts:
##Existing volume mounts for parsers, etc. omitted
- mountPath: /var/log/fb-storage
  name: fb-storage
  readOnly: false

extraVolumes:
- hostPath:
    path: /var/log/fb-storage
    type: DirectoryOrCreate
  name: fb-storage
  • Added an initContainer to change owner/group on mounted volume
initContainers:
- name: chowner-fb-storage
  image: registry.hub.docker.com/library/alpine:3.12.0
  command: ["chown", "3301:3301", "/var/log/fb-storage"]
  securityContext:
    readOnlyRootFilesystem: true
    capabilities:
      drop: ["all"]
      add: ["CHOWN"]
    runAsUser: 0
    runAsNonRoot: false
  volumeMounts:
  - name: fb-storage
    mountPath: /var/log/fb-storage

In my Fluent Bit configuration, I just pointed to the mounted volume in the storage.path parameter in the [SERVICE] station and in the DB parameter of the [INPUT] filter definitions for the 'tail' filters.

Fluent Bit has been running in this configuration for the last few hours without any problems as far as I can tell. Log messages are being collected and forwarded onto their destination (OpenSearch) with no obvious regression in the number of log messages processed. The Fluent Bit pod logs don't show any new ERROR or WARNING messages.

I've SSH'ed onto the Kubernetes nodes and things look "right":

[root@k8s-n1 /]# ps -ef|grep fluent
3301     1960622 1960497  2 17:46 ?        00:00:34 /fluent-bit/bin/fluent-bit --workdir=/fluent-bit/etc --config=/fluent-bit/etc/conf/fluent-bit.conf

[root@k8s-n1 /]# ls -l /var/log
{snip}
drwxr-xr-x   4   3301   3301        93 Apr  5 17:46 fb-storage

[root@k8s-n1 /]# ls -l /var/log/fb-storage/
total 4148
drwxr-xr-x 2 3301 root       6 Apr  5 18:11 tail.1
drwxr-xr-x 2 3301 root       6 Apr  5 18:11 tail.2
-rw-r--r-- 1 3301 root   20480 Apr  5 18:08 fb.db
-rw-r--r-- 1 3301 root   32768 Apr  5 18:11 fb.db-shm
-rw-r--r-- 1 3301 root 4120032 Apr  5 18:11 fb.db-wal

Hmmm, just noticed that the files within the FB storage directory are owned by user '3301' but the group is 'root'. I thought the fsGroup in the securityContext would have forced that to set the group to '3301'. But I think I can live with that.

Anyone see something wrong about this approach? Any hidden things I may be missing?

NOTE: I'm working with Fluent Bit 2.2.2 and Fluent Bit Helm chart version 0.43.0.

@joebowbeer If you get some time, please give this a try and see if it works for you. @PettitWesley Not sure if this would work with the AWS version of Fluent Bit and Helm chart. Let us know if you get a chance to try it out.

Hello,

Could you please confirm if the solution has undergone testing and validation? or any other solutions for this issue?

Thank you.

PettitWesley commented 4 months ago

@onap4105 I think I've tried something equivalent to this before, except I ran the chown command via ssh/exec and it did not work.

onap4105 commented 4 months ago

@onap4105 I think I've tried something equivalent to this before, except I ran the chown command via ssh/exec and it did not work.

Thank you @PettitWesley

gsmith-sas commented 4 months ago

@PettitWesley I wonder if you ran into a timing issue: the pod has to be up and running before you can ssh/exec into it; wouldn't Fluent Bit have already come up and failed (due to file permissions) before you ssh'ed in and had a chance to change the file permissions? Or, is it possible that the issue was caused by differences between the AWS version of Fluent Bit and (non-AWS) Fluent Bit?

I continued to play around with my approach after posting this and Fluent Bit continued to work as expected/desired for several days. I believe I was even able to remove the grant back of the FOWNER capability in the securityContext. So, from my week or two of testing, this approach seems to work. I've held off of moving to this in a more production environment hoping to get some feedback, preferable validation (or clear evidence of problems), from the wider Fluent Bit community. It's always helpful to have someone completely new try things out.

@onap4105 I'm just a Fluent Bit user so I can't offer official support or validation. Give it a try and let us know whether it works in your use-case. Thanks.

onap4105 commented 4 months ago

@gsmith-sas Below are my changes and the initial results based on your suggestions. I am still verifying and understanding the outcomes. Please let me know if you have any advice.

I used https://github.com/fluent/fluent-operator/releases/tag/v2.8.0

  # initContainers test run as non root user
  initContainers:
    - name: chowner-fb-storage
      image: registry.hub.docker.com/library/alpine:3.12.0
      command: ["chown", "3301:3301", "/fluent-bit"]
      securityContext:
        readOnlyRootFilesystem: true
        capabilities:
          drop: ["all"]
          add: ["CHOWN"]
        runAsUser: 0
        runAsNonRoot: false
      volumeMounts:
      - name: positions
        mountPath: /fluent-bit

# Note: I think this is hardcoded in the fluent-bit image, I use it instead of creating a new fb-storage.
Volumes:
  positions:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/fluent-bit/
    HostPathType:

$ helm list -n fluentbit NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION fluent-operator fluentbit 1 2024-04-30 21:57:43.0906769 -0400 EDT failed fluent-operator-2.8.0 2.8.0


-  fluent-operator and fluent-bit deployment/daemonset are up running.
```powershell
$ kubectl get all -n fluentbit
NAME                                             READY   STATUS    RESTARTS   AGE
pod/fluent-bit-8sdnh                             1/1     Running   0          9h
pod/fluent-bit-9xgm2                             1/1     Running   0          9h
pod/fluent-bit-dtqw9                             1/1     Running   0          9h
pod/fluent-bit-fdm9f                             1/1     Running   0          9h
pod/fluent-bit-g54tw                             1/1     Running   0          9h
pod/fluent-bit-t7dw9                             1/1     Running   0          9h
pod/fluent-bit-vk27g                             1/1     Running   0          9h
pod/fluent-bit-wlhvz                             1/1     Running   0          9h
pod/fluent-bit-xx5g4                             1/1     Running   0          9h
pod/fluent-operator-5d466549cb-s8cn6             1/1     Running   0          9h

NAME                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/fluent-bit   ClusterIP   x.x.x.x          <none>        2020/TCP   9h

NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/fluent-bit   9         9         9       9            9           <none>          9h

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/fluent-operator             1/1     1            1           9h

NAME                                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/fluent-operator-5d466549cb             1         1         1       9h

< Fluent Bit v2.2.2

      \
       \
        \          __---__
                _-       /--______
           __--( /     \ )XXXXXXXXXXX\v.
         .-XXX(   O   O  )XXXXXXXXXXXXXXX-
        /XXX(       U     )        XXXXXXX\
      /XXXXX(              )--_  XXXXXXXXXXX\
     /XXXXX/ (      O     )   XXXXXX   \XXXXX\
     XXXXX/   /            XXXXXX   \__ \XXXXX
     XXXXXX__/          XXXXXX         \__---->

---_ XXX/ XXXXXX \ / - --_/ /\ XXXXXX / --/= -\ / XXXXXX '--- XXXXXX -\/XXX\ XXXXXX /XXXXX \XXXXXXXXX \ /XXXXX/ \XXXXXX > _/XXXXX/ \XXXXX--/ -- XXXX/ -XXXXXXXX--------------- XXXXXX- \XXXXXXXXXXXXXXXXXXXXXXXXXX/ ""VXXXXXXXXXXXXXXXXXXV""

[2024/05/01 01:58:00] [ info] [fluent bit] version=2.2.2, commit=eeea396e88, pid=13 [2024/05/01 01:58:00] [ info] [storage] ver=1.5.1, type=memory, sync=normal, checksum=off, max_chunks_up=128 [2024/05/01 01:58:00] [ info] [cmetrics] version=0.6.6 [2024/05/01 01:58:00] [ info] [ctraces ] version=0.4.0 [2024/05/01 01:58:00] [ info] [input:systemd:systemd.0] initializing [2024/05/01 01:58:00] [ info] [input:systemd:systemd.0] storage_strategy='memory' (memory only) [2024/05/01 01:58:00] [ info] [input:tail:tail.1] initializing [2024/05/01 01:58:00] [ info] [input:tail:tail.1] storage_strategy='memory' (memory only) [2024/05/01 01:58:00] [error] [input:tail:tail.1] parser 'cri' is not registered [2024/05/01 01:58:00] [ info] [filter:kubernetes:kubernetes.1] https=1 host=kubernetes.default.svc port=443 [2024/05/01 01:58:00] [ info] [filter:kubernetes:kubernetes.1] token updated [2024/05/01 01:58:00] [ info] [filter:kubernetes:kubernetes.1] local POD info OK [2024/05/01 01:58:00] [ info] [filter:kubernetes:kubernetes.1] testing connectivity with API server... [2024/05/01 01:58:00] [ info] [filter:kubernetes:kubernetes.1] connectivity OK [2024/05/01 01:58:00] [ info] [output:stdout:stdout.0] worker #0 started [2024/05/01 01:58:00] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020 [2024/05/01 01:58:00] [ info] [sp] stream processor started


- inside fluent-bit pod
```powershell
$ id
uid=3301 gid=0(root) groups=0(root),3301

$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
3301           1  0.0  0.0 711144 11944 ?        Ssl  01:58   0:00 /fluent-bit/bin/fluent-bit-watcher
3301          13  0.2  0.0 120000 45676 ?        Sl   01:58   1:24 /fluent-bit/bin/fluent-bit --enable-hot-reload -c /fluent-bit/etc/f3301 

$ ls -lrt / | grep fluent
drwxr-xr-x   1 root root 4096 May  1 01:57 fluent-bit

$ ls -lrt /fluent-bit
total 16
drwxr-xr-x 2 root root 4096 Jan 14 16:22 log
drwxr-xr-x 1 root root 4096 Feb 18 07:53 etc
drwxr-xr-x 1 root root 4096 Feb 18 07:53 bin
drwxrwsrwt 3 root 3301  180 May  1 01:57 config
drwxr-xr-x 2 3301 3301 4096 May  1 01:57 tail

$ ls -lrt ./tail
total 4084
-rw-r--r-- 1 3301 root    8192 May  1 01:58 systemd.db
-rw-r--r-- 1 3301 root   16384 May  1 11:22 pos.db
-rw-r--r-- 1 3301 root   32768 May  1 12:21 pos.db-shm
-rw-r--r-- 1 3301 root 4120032 May  1 12:21 pos.db-wal

/var/lib/fluent-bit# ls -lrt total 4088 -rw-r--r-- 1 3301 root 8192 May 1 01:57 systemd.db -rw-r--r-- 1 3301 root 24576 May 1 02:04 pos.db -rw-r--r-- 1 3301 root 32768 May 1 02:05 pos.db-shm -rw-r--r-- 1 3301 root 4120032 May 1 02:05 pos.db-wal