Open blame19 opened 10 months ago
Hey @blame19 I'm curious about how did you set device_ownership_from_security_context
in the containerd configuration since that particular setting isn't rendered in the containerd configuration created by the API server.
Hey @blame19 I'm curious about how did you set
device_ownership_from_security_context
in the containerd configuration since that particular setting isn't rendered in the containerd configuration created by the API server.
This is exactly the point - I was trying to set it through AWS UserData, but seeing no effect. Since the API doesn't expose it, I assume there is no way to have it?
Hi @blame19, we're looking into this, but our response will most likely be a little delayed, as a lot of folks are not around at the moment.
In the meantime, any additional context on your usecase that you feel comfortable sharing may be useful. Thanks!
Hi @blame19, we're looking into this, but our response will most likely be a little delayed, as a lot of folks are not around at the moment.
In the meantime, any additional context on your usecase that you feel comfortable sharing may be useful. Thanks!
No worries - I totally understand the timing is a little peculiar this time around.
In my use case, I have an EKS cluster with several containers requiring around 15Gi data. I'm storing the data in ready-made EBS volumes. EBS supports multi-attaching in RWX in Block mode. I'm attaching the PVC to my containers as follows:
apiVersion: apps/v1
kind: Deployment
...
spec:
replicas: 1
...
containers:
- name: somepod
...
command: ["sh","-c", "--"]
args: ["mount /dev/sdx /data && while true; do sleep 1000000; done"]
securityContext:
privileged: true
volumeDevices:
- devicePath: /dev/sdx
name: volume
volumes:
- name: volume
persistentVolumeClaim:
claimName: persistent-volume-claim
While this is fine for non-production environments, the issue here is that I don't want to run privileged containers. While looking for a solution and experimenting with the securityContext capabilities, I stumbled upon the afore mentioned guide. From my understanding of the above it should be possible to tell containerd to "relax" the super user requirement for attaching and detaching volumes. This would satisfy my use case. I didn't have the time to test it out yet on a regular node, but due to gpu-specific needs of my containers I would need this on a Bottlerocket image.
A note - I looked into other options, e.g. EFS or storage solutions like GlusterFS. But I would still prefer to test this capability out.
Specifically for device_ownership_from_security_context
- my understanding from the linked blog post is that this only applies to device nodes added through the device plugin API.
While allowing this option to be enabled via the settings API could be useful for something like the NVIDIA device plugin, it wouldn't help you change the permissions for a block device node unless there was a corresponding device plugin installed that would arrange for /dev/sdx
to be added a pod that requested a more generally-named resource, like this fictional example:
resources:
requests:
ebs.aws/multi-attach-block-device: '1'
However, if the end goal is to mount a block device containing a filesystem, the container will likely need to run with CAP_SYS_ADMIN
. The Linux kernel treats mounting a filesystem as a privileged operation. Running in a user namespace (which is not currently supported end-to-end on Bottlerocket) would allow mounting tmpfs
or overlayfs
but not most other filesystems. This may depend on the specific clustered filesystem you're using, but I'm pretty sure it will be a blocker.
If you end up adding CAP_SYS_ADMIN
to the security context (instead of privileged: true
) you wouldn't need the device node to have different uid/gid ownership.
However, if the end goal is to mount a block device containing a filesystem, the container will likely need to run with
CAP_SYS_ADMIN
. The Linux kernel treats mounting a filesystem as a privileged operation. Running in a user namespace (which is not currently supported end-to-end on Bottlerocket) would allow mountingtmpfs
oroverlayfs
but not most other filesystems. This may depend on the specific clustered filesystem you're using, but I'm pretty sure it will be a blocker.
I had tried already with CAP_SYS_ADMIN, but as you mentioned I didn't manage (since it's not supported end-to-end).
An alternative solution I would like to run by you would be adding another ebs volume to the EC2 machine, let's say /dev/sdx, containing my data. This is a non issue since I can provision this easily.
Then the /dev/block
would need to be mounted on the node's filesystem and the data would be accessible through hostPath
or similar.
AWS suggests mounting blocks by logging in the machine, but I would need to do this at start up with some user data script. Any idea on how I could achieve this with Bottlerocket?
Bootstrap containers are the supported mechanism for running scripts at startup. stefansundin/bottlerocket-bootstrap-exec-user-data is one example.
The EBS CSI driver supports block volumes so you could potentially also model this as a daemonset that automatically provisions an EBS volume via PVC.
Image I'm using: bottlerocket-aws-k8s-1.27-nvidia-x86_64-v1.17.0-53f322c2 supplied by AWS
What I expected to happen: Following this guide, I was trying to set :
in order to be able to mount a block device inside a non privileged container.
I can see this as part of the userdata:
What actually happened: Still getting the superuser error:
I realize that the linked blog post may be out of date. I've looked at your docs in security_guidance, that mentions block devices, but I didn't find a possible solution to how to set them up.
How to reproduce the problem: Set the user data. Use a pod to mount a block device. Try to mount the block device without root privileges in the container.