bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.59k stars 506 forks source link

Mkfs fails even from priviledged containers #2319

Open Arau opened 2 years ago

Arau commented 2 years ago

Hi,

we want to run Ondat using Bottlerocket as we have a customer requirement to support Bottlerocket. Ondat is a CSI Driver for Kubernetes that among other things implements a storage engine. Ondat orchestrates data, volume attachments, PVC mounts, etc. The Ondat storage engine runs on a Daemonset in Kubernetes. It mounts the host filesystem on /var/lib/storageos where the data from PVCs will be persisted. That means that Daemonset Pods need to be able to create special device files and create filesystems on top of the devices. We found that even though the Daemonset Pod is running with:

    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        add:
        - SYS_ADMIN
      privileged: true

Ondat cannot mkfs the device files.

We used the admin container to reproduce the same behaviour that the storage engine attempts and it looks like SELinux is blocking the permissions for it.

bash-5.1#
bash-5.1# mkfs.ext4 ./v.00000000-0000-0000-0000-000000001000
mke2fs 1.46.5 (30-Dec-2021)
mkfs.ext4: Permission denied while trying to determine filesystem size

The device file cannot be opened, thus the filesystem cannot be created.

[root@admin]# dd if=/.bottlerocket/rootfs/var/lib/storageos/volumes/v.00000000-0000-0000-0000-000000001000 of=/dev/null bs=4k count=1
dd: failed to open '/.bottlerocket/rootfs/var/lib/storageos/volumes/v.00000000-0000-0000-0000-000000001000': Permission denied

Context

What I expected to happen

I expect to be able to create devices and create filesystems. We think we have narrowed the issue to SELinux, so it would be great to know if what we need to do is possible using Bottlerocket.

Misc

We understand that this procedure must be possible as for instance EBS volumes are allowed to be attached to Bottlerocket and formatted.

Could you please help us understand what needs to be done?

stmcginnis commented 2 years ago

Hello @Arau - thanks for filing an issue. Unfortunately, this will not work the way you are expecting.

Part of the security features of Bottlerocket is it has a read-only root filesystem. You can read more about the design here: https://aws.amazon.com/blogs/opensource/security-features-of-bottlerocket-an-open-source-linux-based-operating-system/

This means you will not be able to write to anything under the /.bottlerocket/rootfs path.

Edit: Sorry, I was just looking at the root path. /.bottlerocket/rootfs/var apparently should work.

bcressey commented 2 years ago

Can you check dmesg on an affected node for avc messages? Those correspond to SELinux denials and will help narrow down the issue.

Generally I'd expect - and want! - CSI drivers to work on Bottlerocket, and as you say the EBS CSI driver does, so it should be possible here.

bcressey commented 2 years ago

Also typically from the admin container you would never see SELinux denials - the processes run with a highly privileged label for break-glass troubleshooting - so it's possible or even likely you won't have any avc denials.

I'd guess it's something else like the device cgroup allowlist blocking these device nodes.

chris-milsted commented 2 years ago

Hi,

I am working with @Arau on this and have a single node cluster which is easier to gather dmesg entries from.

The only entries in dmesg are from tcmu:

[  681.422346] SCSI subsystem initialized
[  702.504179] scsi host0: TCM_Loopback
[  702.546156] tcmu daemon: command reply support 1.
[  702.556775] scsi host0: TCM_Loopback
[  702.557557] scsi 0:0:1:0: Direct-Access     LIO-ORG  TCMU device      0002 PQ: 0 ANSI: 5
[  702.562526] sd 0:0:1:0: [sda] 2048 512-byte logical blocks: (1.05 MB/1.00 MiB)
[  702.562529] sd 0:0:1:0: [sda] 4096-byte physical blocks
[  702.562570] sd 0:0:1:0: [sda] Write Protect is off
[  702.562572] sd 0:0:1:0: [sda] Mode Sense: 2f 00 00 00
[  702.562636] sd 0:0:1:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  702.562699] sd 0:0:1:0: [sda] Optimal transfer size 131072 bytes
[  702.609528] sd 0:0:1:0: [sda] Attached SCSI disk
[  702.659288] sd 0:0:1:0: [sda] Synchronizing SCSI cache
[  702.818381] tcmu daemon: command reply support 1.
[  703.110925] tcmu daemon: command reply support 1.
[  817.011593] scsi host0: TCM_Loopback
[  817.012524] scsi 0:0:1:0: Direct-Access     LIO-ORG  TCMU device      0002 PQ: 0 ANSI: 5
[  817.013231] sd 0:0:1:0: [sda] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
[  817.013233] sd 0:0:1:0: [sda] 4096-byte physical blocks
[  817.013258] sd 0:0:1:0: [sda] Write Protect is off
[  817.013259] sd 0:0:1:0: [sda] Mode Sense: 2f 00 00 00
[  817.013313] sd 0:0:1:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  817.013359] sd 0:0:1:0: [sda] Optimal transfer size 131072 bytes
[  817.049109] sd 0:0:1:0: [sda] Attached SCSI disk

If I run mkfs -t ext4 -b 4096 -D -F -E lazy_journal_init=1,lazy_itable_init=1 /var/lib/storageos/volumes/v.00000000-0000-0000-0000-000000001000 in the Ondat container I do not see any more entries in dmesg from selinux.

As an experiment I tried to do this directly to the /dev/sda device and this worked:

# mkfs -t ext4 -b 4096 -D -F -E lazy_journal_init=1,lazy_itable_init=1 /dev/sda
mke2fs 1.45.6 (20-Mar-2020)
Discarding device blocks: done                            
Creating filesystem with 5242880 4k blocks and 1310720 inodes
Filesystem UUID: 17aa4119-f0f9-4734-b922-a002c43df710
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
    4096000

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done   

The selinux contexts do seem to be different though:

[root@ip-192-168-76-215 volumes]# ls -Zal /dev/sda
brw-rw----. 1 root 993 system_u:object_r:any_t:s0 8, 0 Aug  3 14:52 /dev/sda
[root@ip-192-168-76-215 volumes]# ls -Zal /var/lib/storageos/volumes/v.00000000-0000-0000-0000-000000001000
brw-------. 1 root root system_u:object_r:local_t:s0 8, 0 Aug  3 14:40 /var/lib/storageos/volumes/v.00000000-0000-0000-0000-000000001000

These devices should be mapped to each other (major, minor device numbers).

I did try to relabel the device with selinux and this did log an selinux error so this does seem to be working:

[ 2233.389200] audit: type=1400 audit(1659539054.983:7): avc:  denied  { relabelfrom } for  pid=20357 comm="chcon" name="v.00000000-0000-0000-0000-000000001000" dev="nvme2n1p1" ino=133655 scontext=system_u:system_r:control_t:s0-s0:c0.c1023 tcontext=system_u:object_r:local_t:s0 tclass=blk_file permissive=0

Are there additional logs we can enable to debug this?

Chris

Arau commented 2 years ago

Hi @bcressey,

I think we have found why Ondat cannot write. The daemonset pod mounts with Bidirectional mount propagation from the container to the host fs at /var/lib/storageos then when a device needs to be created, Ondat creates it at /var/lib/storageos/volumes however the initial mount used nodev. As I understand it, the nodev flag in the mount would avoid the use of device files from that mounted fs.

➜  bottlerocket k -n storageos exec -it storageos-node-ns9bt -- grep lib/storageos /proc/mounts
Defaulted container "storageos" out of: storageos, csi-driver-registrar, csi-liveness-probe, init (init)
/dev/nvme1n1p1 /var/lib/storageos ext4 rw,seclabel,nosuid,nodev,noatime 0 0

Is there a way for us to be able to execute that mount without that flag? I am not sure if it is containerd configuration that can be changed, or if we can apply some configuration to the kubelet.

bcressey commented 2 years ago

If the volumes directory doesn't contain other state, just device nodes, it might work to mount an additional emptyDir volume with the medium: Memory option set, so that a new tmpfs without the nodev option is placed there.

Otherwise (with CAP_SYS_ADMIN) you should be able to remount the bind mount for the directory with the dev option:

mount -o remount,dev /var/lib/storageos
Arau commented 2 years ago

Hi,

I would like to give an update on the issue. We have been working on different ways to get the mount options set correctly. the idea of the memory medium is quite interesting but it opened a can of warms with other dependencies. We also tried with sym links to circumvent the mount inheritance, but since '/' is read only, that would work. The emptyDir couldn't work because the bind mount that the ondat container has where data is stored needs to be set. It can not be any random dir or tmp dri in the host. That is because among other things both devices alongside data reside on specific locations in the FS.

The execution of the remount from inside the Ondat container works because as you mentioned, we run with CAP_SYS_ADMIN. To productionise we tried to run it on an init container that shares most of the volume mounts as the main container. However that didn't work. I thought that an init container would share the mount table on the same namespaced filesystem. However, for a reason I don't fully understand yet, that doesn't work either, we might be missing some system bind mounts in the init container to persist the change on the host.

Finally, we decided to mitigate the issue by running the remount from the main container code at bootstrap. That is effective and successful even though it is not ideal IMO. It would be best to be able to tune or tweak those flags from a configuration and declarative point of view.

stmcginnis commented 1 year ago

It looks like things are working now, based on the last message. Though maybe not as smoothly as originally hoped.

@bcressey any thoughts on the last comment and if there is anything that could be done from the Bottlerocket side to improve this? If not, I think we can close this issue if no further action can be taken on it.