ceph / ceph-nvmeof

Service to provide Ceph storage over NVMe-oF/TCP protocol
GNU Lesser General Public License v3.0
82 stars 44 forks source link

Restrict addition of namespaces with RBD images which are already part of other NVMeOF GW groups as namespaces to avoid data corruption #833

Open rahullepakshi opened 2 weeks ago

rahullepakshi commented 2 weeks ago

This might be tough to implement as RBD image metadata are part of different omap state files but we need to find a way to restrict creation of namespaces with RBD images which are already part of other GW groups as namespaces, as these volumes can be accessed by multiple initiators causing data inconsistency. With huge scale, say 1K to 4K RBD images, it will be difficult for user to keep track of used/unused images to create namespaces . Please let me know your thoughts.

idryomov commented 2 weeks ago

At the RBD level, this can be done with rbd_lock_acquire() API, passing in RBD_LOCK_MODE_EXCLUSIVE for lock_mode. Internally this is implemented by disabling automatic exclusive lock transitions, so it is an option only for images with exclusive-lock feature enabled.

Another option is to employ advisory locking at the RADOS level, placing a lock on image's rbd_header.XYZ object using rados_lock_exclusive() API.

Yet another option might to be use RBD per-image metadata, but it wouldn't be atomic unlike an approach that involves locks.

How any of these approaches would interact with HA and images being potentially moved between groups would need to be investigated.

caroav commented 2 weeks ago

@idryomov I think that the request is more about the configuration. Not about which gw is using the image to do IO. Even if a all GWs in the group are in maintenance for example, the request is to not allow the user to use the same images for other nvmeof namespaces in another group. Given that, do yo still think that rbd_lock_acquire() API addresses it?

Does it make sense in this case to use RBD namespaces? I.e. a namespaces per gw group?

idryomov commented 2 weeks ago

Even if a all GWs in the group are in maintenance for example, the request is to not allow the user to use the same images for other nvmeof namespaces in another group. Given that, do yo still think that rbd_lock_acquire() API addresses it?

Likely not. If so, I think it should be enforced through OMAP state files^Wobjects, even if that is not entirely trivial.

idryomov commented 2 weeks ago

(Rahul reached out to me asking specifically for input from the RBD perspective on "setting a flag/lock on a image", so I may have misinterpreted this.)

Does it make sense in this case to use RBD namespaces? I.e. a namespaces per gw group?

An RBD namespace can be thought of as a directory within a pool. If the concern is an operator exporting an image that they shouldn't be exporting at that moment (because it's already exported), but in general they should be able to (meaning that it's not a matter of access control), I don't see how placing images in namespaces would make a difference.

caroav commented 2 weeks ago

I'm not familiar with rbd namespaces. But my thinking was that if in each state file (omap) of a group, we access different rbd namespace (i.e. we look in a namespace), then maybe we could avoid the mix. But then it probably means that nvmeof users will need to create the images in the right rbd namespace.

idryomov commented 2 weeks ago

This might be too restrictive. Would a scenario of images being "moved" between groups be common? An operator wanting to e.g. export an image through group1 today and through group2 tomorrow seems reasonable to me.