Cannot execute binaries stored in an NFS Server running on a Bottlerocket node

liam-mackie commented 1 month ago

Image I'm using: Bottlerocket OS 1.20.4 (aws-k8s-1.30)

Context: We have some software that runs multiple pods for multiple stages in a pipeline. To be able to complete this dynamically and allow retries on specific steps, we spawn short-lived pods that connect to an NFS server running in-cluster for its ephemeral data. A typical installation would have the orchestrator and the NFS server to begin with. When the orchestrator receives a piece of work, it will:

Create a subfolder in the NFS server, and download any required executables to it
Spawn a pod, which runs the executables in the subfolder
Once the pod has finished running, the subfolder gets removed

The NFS server is a simple variant of this alpine server.

What I expected to happen: When running an NFS Server in a container in bottlerocket, you are able to execute files on the share from a mount in a different container.

What actually happened: The NFSD process is denied execute access. This is exhibited in this AVC denial log:

Jul 26 01:20:06 ip-10-0-19-55.ap-southeast-2.compute.internal audit[2830356]: AVC avc:  denied  { execute } for  pid=2830356 comm="nfsd" name="bootstrapRunner" dev="nvme1n1p1" ino=151427631 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:data_t:s0:c432,c649 tclass=file permissive=0

From what I can tell, this is because the process is running as a kernel task, even though it's actually exposing data from a share from a container. My current line of thinking is that this is because it's a privileged container and actually hooking into the kernel-level support. The nfsd processes have the system_u:system_r:kernel_t:s0 SELinux context, and are not children of the NFS server pod.

What I've tried to do to work around the problem: I've attempted to work around this problem by using EFS rather than locally hosting, but when using access points and dynamically provisioned volumes, chmod commands get permission denied, which fails many scripts (and even tar in some cases).

How to reproduce the problem: To reproduce the problem, you can create the resources I've added below in a Kubernetes cluster that is running Bottlerocket OS 1.20.4. I have been doing this in an AWS EKS cluster.

You will be able to see the logs after running logdog from the admin container in the node running the NFS server, not the nfs-client pod. To run this reproduction, you will also need the NFS CSI driver, which you can install using helm:

helm upgrade --install --atomic \
--repo https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts \
--namespace kube-system \
--version v4.6.0 \
csi-driver-nfs \
csi-driver-nfs

If you deploy this outside of the default namespace, please adjust the server URL to instead point to the namespace you're deploying to - replace nfs.default.svc.cluster.local with nfs.<your-namespace>.svc.cluster.local.

Resources:

NFS Server

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nfs-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: nfs
  serviceName: "nfs"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: nfs
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
              - key: kubernetes.io/arch
                operator: In
                values:
                - arm64
                - amd64
      containers:
      - env:
        - name: SHARED_DIRECTORY
          value: /octopus
        - name: SYNC
          value: "true"
        image: octopusdeploy/nfs-server:1.0.1
        imagePullPolicy: IfNotPresent
        name: nfs-server
        ports:
        - containerPort: 2049
          protocol: TCP
        resources:
          requests:
            cpu: 50m
            memory: 50Mi
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /octopus
          name: octopus-volume
      restartPolicy: Always
      volumes:
      - emptyDir:
          sizeLimit: 10Gi
        name: octopus-volume
  updateStrategy:
    type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
  name: nfs
spec:
  clusterIP: None
  ports:
  - name: nfs
    port: 2049
    protocol: TCP
    targetPort: 2049
  selector:
    app.kubernetes.io/name: nfs
  sessionAffinity: None
  type: ClusterIP

PV/PVC

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-pv-10gi
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 10Gi
  csi:
    driver: nfs.csi.k8s.io
    volumeAttributes:
      server: nfs.default.svc.cluster.local
      share: /
    volumeHandle: nfs.default.svc.cluster.local/octopus##
  mountOptions:
  - nfsvers=4.1
  - lookupcache=none
  - soft
  - timeo=50
  - retrans=4
  persistentVolumeReclaimPolicy: Retain
  storageClassName: nfs-csi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc-10gi
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: nfs-csi

Client Pod


apiVersion: apps/v1
kind: Deployment
metadata:
  name: nfs-client
spec:
  selector:
    matchLabels:
      app: nfs-client
  template:
    metadata:
      labels:
        app: nfs-client
    spec:
      containers:
      - name: nfs-client
        image: alpine
        command: ["sh"]
        args: 
        - -c
        - 'echo "echo \"hello world\"" > /octopus/runme.sh && chmod +x /octopus/runme.sh && sh -c "/octopus/runme.sh"'
        resources:
          limits:
            memory: "128Mi"
            cpu: "500m"
        volumeMounts:
        - mountPath: /octopus
          name: mount
      volumes:
      - name: mount
        persistentVolumeClaim:
          claimName: nfs-pvc-10gi```

larvacea commented 1 month ago

Thanks for the report; I am investigating, and I will let you know what I find out. In the meantime, I can offer some other persistent storage options, in case any of them would be helpful. You mention both self-hosted NFS and EFS. A few other possibilities you might consider:

FSx for ONTAP, which can serve shares over NFS. I have been able to mount volumes I provisioned on this service from Bottlerocket-hosted containers in an EKS cluster.
FSx for Lustre, a non-NFS file server. I have been able to mount shares both natively and using CSI drivers, also in an EKS cluster. I have not attempted to change permissions on files on either file server, nor execute files from them, so I can't guarantee that these will work for your application.

bcressey commented 1 month ago

You may be running into a variation of the behavior discussed here:

For overlayfs, the mounting process credentials are saved and used for subsequent access checks from other processes, so those credentials need to grant a superset of permissions.

nfsd is running as a kernel thread with the kernel_t label
it's serving files from a directory with the data_t:s0:c432,c649 label (an overlayfs mount for a container)
processes with the kernel_t label are blocked from executing files owned by containers

nfsd isn't actually trying to execute the binary itself (it's a kernel thread, it can't really do that); it's just having its permissions checked (because of overlayfs), and it doesn't have the execute permission, so the action is blocked.

One way to work around this might be to mount in a directory from the host's /local as a hostPath volume mount and use that as the NFS server root. That will avoid the overlayfs permission check that I suspect is causing this denial. (Other volume types should work too.)

larvacea commented 1 month ago

If you can, we'd love to hear back how these suggestions are working (or not working) for you. Thanks!

liam-mackie commented 1 month ago

Hi! Sorry for the late reply, for some reason, GitHub decided that I did not want to receive emails about this issue 🤦. Thanks for the excellent suggestions about different RWX volume types, though since we need to support many other node types and environments, I'm uncertain if it's suitable. The most promising so far is simply using hostPath, which I'll test now and get back to you with results. I did assume that nfsd wasn't actually attempting to execute the file, but was just an access check - thanks for linking me to the behaviour with overlayFS, this connects many of the dots for me!

liam-mackie commented 1 month ago

Unfortunately, we still get the same issue mounting from /local The AVC Denial:

Aug 05 23:30:45 ip-10-0-42-10.ap-southeast-2.compute.internal audit[45476]: AVC avc:  denied  { execute } for  pid=45476 comm="nfsd" name="exec.sh" dev="nvme1n1p1" ino=18270915 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:local_t:s0 tclass=file permissive=0

The file:

bash-5.1# ls -laZi  ./test/
total 4
18270913 drwxr-xr-x. 2 root root system_u:object_r:local_t:s0 21 Aug  5 23:30 .
  457164 drwxr-xr-x. 5 root root system_u:object_r:local_t:s0 50 Aug  5 23:30 ..
18270915 -rwxr-xr-x. 1 root root system_u:object_r:local_t:s0 13 Aug  5 23:30 exec.sh

It still seems that nfsd is still attempting to check the permissions - I'm not sure if this is something I've done wrong in the mount? Any ideas?

bcressey commented 3 weeks ago

It still seems that nfsd is still attempting to check the permissions [...]

I need to set up a repro case locally to try to understand what's going on with SELinux, but I expect it'll need a policy fix on the Bottlerocket side.

liam-mackie commented 1 week ago

It still seems that nfsd is still attempting to check the permissions [...]

I need to set up a repro case locally to try to understand what's going on with SELinux, but I expect it'll need a policy fix on the Bottlerocket side.

Hi Ben! I was wondering if there's anything I could do to help repro this issue locally, or if I can help with my existing repro at all?

bcressey commented 3 days ago

Hey Liam - I've been able to repro the issue using the steps you provided. Thanks for the detailed instructions.

Despite what I wrote earlier, there doesn't seem to be any overlayfs involvement here. octopus-volume is just a directory under /var/lib/kubelet/pods labeled with the pod's SELinux pair and bind-mounted in:

# grep octopus /proc/$(pgrep nfsd.sh)/mountinfo
4327 4319 259:17 /var/lib/kubelet/pods/d98aa2fb-12c2-4e16-a5d0-e829c60a490f/volumes/kubernetes.io~empty-dir/octopus-volume /octopus rw,nosuid,nodev,noatime - xfs /dev/nvme1n1p1 rw,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=8,swidth=8,noquota

I prodded at it with ftrace:

# cd /sys/kernel/tracing
echo -n 10 > max_graph_depth
echo nfsd_permission > set_graph_function
echo -n function_graph > current_tracer
cat trace

... and it just looks like a straightforward SELinux permission check failure, where nfsd checks the inode permission, which checks the SELinux permission, which says that kernel_t can't execute a data_t file:

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 1)   2.200 us    |  nfsd_permission [nfsd]();
 1)   0.610 us    |  nfsd_permission [nfsd]();
 ------------------------------------------
 1)   nfsd-7795    =>   nfsd-7794
 ------------------------------------------

 1)   1.120 us    |  nfsd_permission [nfsd]();
 1)               |  nfsd_permission [nfsd]() {
 1)               |    inode_permission() {
 1)   0.550 us    |      generic_permission();
 1)               |      security_inode_permission() {
 1)               |        selinux_inode_permission() {
 1)   0.780 us    |          __inode_security_revalidate();
 1)   0.530 us    |          __rcu_read_lock();
 1)   0.540 us    |          avc_lookup();
 1)   0.540 us    |          __rcu_read_unlock();
 1)   4.650 us    |        }
 1)   0.560 us    |        bpf_lsm_inode_permission();
 1)   6.720 us    |      }
 1)   8.740 us    |    }
 1)   9.850 us    |  }
 1)   0.630 us    |  nfsd_permission [nfsd]();

Unfortunately I'm still not sure on what the best way to fix this is.

liam-mackie commented 3 days ago

Thanks for the update, Ben! I've been able to get this working by using the userspace NFS implementation with ganesha-nfs instead of the kernel implementation, since the inode checks seem to happen in the context of the container instead of kernel.

At this point, I think the only way this would work is if nfsd ran in a different context (preferably the container exporting the mount).

I don't know enough about SELinux to tell if that's a terrible idea or not, or if that's even possible. I think we can probably close this for now, with the understanding that userspace NFS implementations are preferred.

bcressey commented 2 days ago

I don't know enough about SELinux to tell if that's a terrible idea or not, or if that's even possible. I think we can probably close this for now, with the understanding that userspace NFS implementations are preferred.

I have a couple ideas that I'd like to explore, so I'm happy to keep it open until there's some kind of resolution.

For the first idea: the /opt/csi directory on the host is special-cased where privileged containers can write to it, and some host programs can execute files there. This was added in #3779 to support the S3 CSI driver. Right now only init_t can execute the files but we could potentially allow kernel_t to execute as well. The catch would be that the NFS shares would all have use a hostPath volume from under that directory, which would be annoying.

My other idea is to allow kernel_t the "execute" permission, but to have it trigger a transition to a different type, and then block that transition to prevent execution. Roughly:

# always change from "kernel_t" to "forbidden_t"  when executing a "data_t" file 
(typetransition kernel_t container_exec_o process forbidden_t)

# but, don't actually allow this change to take place 
(neverallow kernel_t forbidden_t (processes (transform))

That would have the property that nfsd (which must run as kernel_t) would pass these inode permission checks, while still preventing the kernel from actually executing untrusted binaries (which is the policy objective, and which nfsd doesn't need to do). However, I need to write some test cases to be sure that it's doing the right thing, and still blocking what it's meant to block.

liam-mackie commented 2 days ago

My other idea is to allow kernel_t the "execute" permission, but to have it trigger a transition to a different type, and then block that transition to prevent execution.

That's an ingenious way to solve the problem! Hopefully it works - I think it's a better fix than forcing NFS to use hostPath volumes.

Thanks for your help with this, by the way. Investigating this problem has opened my eyes a lot to how SELinux and Bottlerocket works in general, and it's definitely becoming my distro of choice for EKS!

bottlerocket-os / bottlerocket

Cannot execute binaries stored in an NFS Server running on a Bottlerocket node #4116

Resources: