bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.77k stars 517 forks source link

Need to allow block devices mounting from non root containers #3681

Open blame19 opened 10 months ago

blame19 commented 10 months ago

Image I'm using: bottlerocket-aws-k8s-1.27-nvidia-x86_64-v1.17.0-53f322c2 supplied by AWS

What I expected to happen: Following this guide, I was trying to set :

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    device_ownership_from_security_context = true

in order to be able to mount a block device inside a non privileged container.

I can see this as part of the userdata: image

What actually happened: Still getting the superuser error:

nonroot@manual-fs-1b-78ccb57c5f-27kmk:/workspace$ mount /dev/block /data
mount: /data: must be superuser to use mount.

I realize that the linked blog post may be out of date. I've looked at your docs in security_guidance, that mentions block devices, but I didn't find a possible solution to how to set them up.

How to reproduce the problem: Set the user data. Use a pod to mount a block device. Try to mount the block device without root privileges in the container.

arnaldo2792 commented 10 months ago

Hey @blame19 I'm curious about how did you set device_ownership_from_security_context in the containerd configuration since that particular setting isn't rendered in the containerd configuration created by the API server.

blame19 commented 10 months ago

Hey @blame19 I'm curious about how did you set device_ownership_from_security_context in the containerd configuration since that particular setting isn't rendered in the containerd configuration created by the API server.

This is exactly the point - I was trying to set it through AWS UserData, but seeing no effect. Since the API doesn't expose it, I assume there is no way to have it?

rpkelly commented 10 months ago

Hi @blame19, we're looking into this, but our response will most likely be a little delayed, as a lot of folks are not around at the moment.

In the meantime, any additional context on your usecase that you feel comfortable sharing may be useful. Thanks!

blame19 commented 10 months ago

Hi @blame19, we're looking into this, but our response will most likely be a little delayed, as a lot of folks are not around at the moment.

In the meantime, any additional context on your usecase that you feel comfortable sharing may be useful. Thanks!

No worries - I totally understand the timing is a little peculiar this time around.

In my use case, I have an EKS cluster with several containers requiring around 15Gi data. I'm storing the data in ready-made EBS volumes. EBS supports multi-attaching in RWX in Block mode. I'm attaching the PVC to my containers as follows:

apiVersion: apps/v1
kind: Deployment
...
spec:
  replicas: 1
...
  containers:
  - name: somepod
   ...
    command: ["sh","-c", "--"]
    args: ["mount /dev/sdx /data && while true; do sleep 1000000; done"]
    securityContext:
      privileged: true
    volumeDevices:
    - devicePath: /dev/sdx
      name: volume
  volumes:
  - name: volume
    persistentVolumeClaim:
      claimName: persistent-volume-claim

While this is fine for non-production environments, the issue here is that I don't want to run privileged containers. While looking for a solution and experimenting with the securityContext capabilities, I stumbled upon the afore mentioned guide. From my understanding of the above it should be possible to tell containerd to "relax" the super user requirement for attaching and detaching volumes. This would satisfy my use case. I didn't have the time to test it out yet on a regular node, but due to gpu-specific needs of my containers I would need this on a Bottlerocket image.

A note - I looked into other options, e.g. EFS or storage solutions like GlusterFS. But I would still prefer to test this capability out.

bcressey commented 10 months ago

Specifically for device_ownership_from_security_context - my understanding from the linked blog post is that this only applies to device nodes added through the device plugin API.

While allowing this option to be enabled via the settings API could be useful for something like the NVIDIA device plugin, it wouldn't help you change the permissions for a block device node unless there was a corresponding device plugin installed that would arrange for /dev/sdx to be added a pod that requested a more generally-named resource, like this fictional example:

resources:
  requests:
    ebs.aws/multi-attach-block-device: '1'

However, if the end goal is to mount a block device containing a filesystem, the container will likely need to run with CAP_SYS_ADMIN. The Linux kernel treats mounting a filesystem as a privileged operation. Running in a user namespace (which is not currently supported end-to-end on Bottlerocket) would allow mounting tmpfs or overlayfs but not most other filesystems. This may depend on the specific clustered filesystem you're using, but I'm pretty sure it will be a blocker.

If you end up adding CAP_SYS_ADMIN to the security context (instead of privileged: true) you wouldn't need the device node to have different uid/gid ownership.

blame19 commented 9 months ago

However, if the end goal is to mount a block device containing a filesystem, the container will likely need to run with CAP_SYS_ADMIN. The Linux kernel treats mounting a filesystem as a privileged operation. Running in a user namespace (which is not currently supported end-to-end on Bottlerocket) would allow mounting tmpfs or overlayfs but not most other filesystems. This may depend on the specific clustered filesystem you're using, but I'm pretty sure it will be a blocker.

I had tried already with CAP_SYS_ADMIN, but as you mentioned I didn't manage (since it's not supported end-to-end).

An alternative solution I would like to run by you would be adding another ebs volume to the EC2 machine, let's say /dev/sdx, containing my data. This is a non issue since I can provision this easily. Then the /dev/block would need to be mounted on the node's filesystem and the data would be accessible through hostPath or similar.

AWS suggests mounting blocks by logging in the machine, but I would need to do this at start up with some user data script. Any idea on how I could achieve this with Bottlerocket?

bcressey commented 9 months ago

Bootstrap containers are the supported mechanism for running scripts at startup. stefansundin/bottlerocket-bootstrap-exec-user-data is one example.

The EBS CSI driver supports block volumes so you could potentially also model this as a daemonset that automatically provisions an EBS volume via PVC.