ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.19k stars 527 forks source link

CephFS keyring requires nonsensicaly enormous and insecure privileges to work #4677

Open benapetr opened 2 weeks ago

benapetr commented 2 weeks ago

Describe the bug

Right now (I tried this with many combinations) the smallest working caps that work with CephFS storage class are those:

      - mon: 'allow r'
      - osd: 'allow rw tag cephfs metadata=fs_k8s, allow rw tag cephfs data=fs_k8s'
      - mds: 'allow r fsname=fs_k8s path=/volumes, allow rws fsname=fs_k8s path=/volumes/k8s_pb'
      - mgr: 'allow rw'

That is with dedicated FS "fs_k8s" namespace that is however intended to be shared by multiple separate clusters.

Removing or reducing any bit in any way results in errors like

  Warning  ProvisioningFailed    2m2s (x13 over 16m)   cephfs.csi.ceph.com_ceph-csi-cephfs-provisioner-756d7bb54f-z5pr7_eda038b5-5ace-4e38-940b-68bfe6c76e31  failed to provision volume with StorageClass "ceph-cephfs-sc": rpc error: code = Internal desc = rados: ret=-1, Operation not permitted

Those permissions on low level grant:

This is enormous security hole that makes isolation within same FS namespace impossible. Only way to workaround this is to install dedicated CEPH cluster for each CephFS CSI consumer.

You can also create a dedicated FS namespace with own MDS, but that still doesn't prevent the CSI keyring from abusing the MGR rw caps.

Why are such enormous privileges needed? It's perfectly possible to work with CephFS without any access (even read-only is not needed) for metadata pool as only MDS is supposed to access that. RW OSD access is only needed for data pools that are used by folders that cluster subvolume group is mapped to, no need to map all of them.

MGR rw caps are probably needed to access to MGR API for subvolume management, but most of those operations can be handled via alternative ways, like .snap folders for snapshot creation.

Basically list of unnecessary permissions:

This is a big security obstacle if you want to create secure environment

Environment details

Steps to reproduce

Steps to reproduce the behavior:

Try to create keyring that is restricted only to specific data pool, with no access to metadatapool, or mgr. CephFS is going to be mountable and usable just fine with such keyring, but CephFS Storage class is going to be unusable (only permission denied for anything)

Actual results

Getting permission denied unless the keyring has almost admin-like caps

Expected behavior

Storage class should not require admin-like caps to work with CephFS. Regular restricted caps should be enough.

Logs

  Normal   Provisioning          2m2s (x13 over 16m)   cephfs.csi.ceph.com_ceph-csi-cephfs-provisioner-756d7bb54f-z5pr7_eda038b5-5ace-4e38-940b-68bfe6c76e31  External provisioner is provisioning volume for claim "monitoring-system/prometheus-prometheus-prometheus-db-prometheus-prometheus-prometheus-0"
  Warning  ProvisioningFailed    2m2s (x13 over 16m)   cephfs.csi.ceph.com_ceph-csi-cephfs-provisioner-756d7bb54f-z5pr7_eda038b5-5ace-4e38-940b-68bfe6c76e31  failed to provision volume with StorageClass "ceph-cephfs-sc": rpc error: code = Internal desc = rados: ret=-1, Operation not permitted

Additional context

This was already discussed in https://github.com/ceph/ceph-csi/issues/1818#issuecomment-1057467489

nixpanic commented 2 weeks ago

Hi @benapetr,

Except for permissions to work with CephFS, Ceph-CSI needs to store additional metadata for mapping of (CSI) volume-handles to CephFS details. This metadata is stored directly in Rados OMAPs, which should explain the need for extra permissions.

If there is a reduced permission set that allows to work with CephFS and Rados, we obviously would appreciate guidance in dropping unneeded capabilities.

Details about the required capabilities are documented in docs/capabilities.md.

benapetr commented 2 weeks ago

So does that mean that only safe and truly isolated way to allow multiple k8s clusters to use CephFS is to build dedicated CEPH cluster for each k8s cluster? That is indeed not very efficient.