OpenNebula / one

The open source Cloud & Edge Computing Platform bringing real freedom to your Enterprise Cloud 🚀
http://opennebula.io
Apache License 2.0
1.23k stars 478 forks source link

Implement Ceph RBD fencing #6640

Open hydro-b opened 3 months ago

hydro-b commented 3 months ago

Description At any given time only one virtual machine should be able to write to RBD image(s). No other VM should be able to acquire an exclusive lock and use the RBD image(s) without first blocklisting the other client (at least when the disk is not considered "sharable", and also not considering the live-migrate handover) . This to avoid data corruption when two VMs (i.e. two instances of the same VM) are running and overwriting each others data. This can happen when a VM is started when there is already an instance of that VM running (zombie). One example of such a case is when a VM live migration has "failed" (but actually succeeded), the VM goes into poweroff state because the VM is not in running state and gets resumed again. See this issue for example. OpenNebula, as the orchestrator (Cloud Management Platform, CMP), should prevent this situation from happening.

Should Ceph handle this issue? No, and this thread, in which OpenNebula is also used as CMP, explains why. It is the responsibility of the CMP to make sure these situations get handled properly. See this documentation how blocklisting could be used to prevent data corruption from happening (fencing).

Implementation considerations:

Ideally the implementation tries to avoid any potential race conditions as much as possible (i.e. make it an atomic operation)

Use case This should be part of the Ceph implementation of OpenNebula by default for all deployments. This prevents data corruption when multiple VM (instances) are running at the same time all trying to write to the same disks.

Interface Changes No external interface changes are needed, this should all be handled by OpenNebula internally (i.e. drivers / libvirt hooks)

Additional Context This is not just a hypothetical case. This has happened to different OpenNebula / Ceph users in production environments.

Progress Status

atodorov-storpool commented 3 months ago

Wow! This issue is like a deja vu for me. Because this was the first issue reported when I started my journey in the OpenNebula universe ~10 years ago. The implemented upstream script(_tmmad/failmigrate, and the _migrateother() function implemented later) has been part of the storpool driver from that time. The issue is not Ceph-related. It is a generic one as there are other cases leading to data corruption situations that need to be covered too.

So a general/generic "fencing" suggested here is a good thing :+1:

hydro-b commented 3 months ago

Wow! This issue is like a deja vu for me. Because this was the first issue reported when I started my journey in the OpenNebula universe ~10 years ago. The implemented upstream script(_tmmad/failmigrate, and the _migrateother() function implemented later) has been part of the storpool driver from that time.

Interesting. The tm_mad/failmigrate script in Ceph exists with a 0, so no magic there. There are indeed some leftovers in the /var/lib/one/datastores/system_ds/$vm-id on the hypervisor. Let's see if that can be improved upon.

The issue is not Ceph-related. It is a generic one as there are other cases leading to data corruption situations that need to be covered too.

If you have any examples you can share / think of that would be great. Or Maybe we can add those to a "generic storage fencing" issue (see below) so we can catch both the general case and the corner cases.

So a general/generic "fencing" suggested here is a good thing 👍

I expect each (shared) storage system has it's own implementation of dealing with this issue. I created this issue to get a tailor made solution for Ceph. That could of course be based on a more generic storage fencing solution with hooks to the specific storage system used. Ideally as much boilerplate as possible is shared across specific implementations.

Do you think we should create a generic "storage fencing solution" and make references to individual storage specific issues (like this one)?

atodorov-storpool commented 3 months ago

Interesting. The tm_mad/failmigrate script in Ceph exists with a 0, so no magic there. There are indeed some leftovers in the /var/lib/one/datastores/system_ds/$vm-id on the hypervisor. Let's see if that can be improved upon. Definitely you could take a look at how much work is done in this script in addon-storpool (well, not much - it is like the postmigrate script ran on the destination host instead of the source host)

If you have any examples you can share / think of that would be great. Or Maybe we can add those to a "generic storage fencing" issue (see below) so we can catch both the general case and the corner cases.

So a general/generic "fencing" suggested here is a good thing 👍

I expect each (shared) storage system has it's own implementation of dealing with this issue. I created this issue to get a tailor made solution for Ceph. That could of course be based on a more generic storage fencing solution with hooks to the specific storage system used. Ideally as much boilerplate as possible is shared across specific implementations.

Do you think we should create a generic "storage fencing solution" and make references to individual storage specific issues (like this one)?

I believe that Opennebula should generally be aware of the risks the shared storages introduce and give proper "hints" of what is expected of the storage drivers. The storage drives should do their best to resolve/mitigate the issue by having the context and using their related capabilities. In brief - the drivers should be notified before a VM instance is started on a host to ensure that the disks are not attached elsewhere. In the case of migration, it should ensure that disks are attached only to one instance at the end, no matter whether migrates succeeded or failed. I mean for any migration (don't look at the tm/storpool/migrate script because it is a nightmare there). Because it is the only script called when cold migration is executed, in two contexts - for the image datastore and then for the system datastore...