canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.34k stars 930 forks source link

Disconnecting cluster member from network and then moving ceph based instance doesn't stop the instance on disconnected member #12526

Open tomponline opened 11 months ago

tomponline commented 11 months ago
lxc launch ubuntu:22.04 c1 --storage ceph --target=foo
# Disconnect foo from network
lxc move c1 --target=bar
lxc start c1
# Reconnect foo to network
ps aux still shows c1 running on foo

We should investigate whether this is causing 2 concurrent mappings of the same ceph volume on two members, which could cause data corruption. As well as investigate when/how the instance on the foo member should be forcefully stopped.

tomponline commented 6 months ago

We should investigate whether this is causing 2 concurrent mappings of the same ceph volume on two members, which could cause data corruption. As well as investigate when/how the instance on the foo member should be forcefully stopped.

After some investigation it appears that Ceph itself is preventing concurrent access of the VM's disk.

By way of its exclusive locks feature:

https://docs.ceph.com/en/latest/rbd/rbd-exclusive-locks/

Using a MicroCloud deployment:

root@micro01:~# lxc launch ubuntu:22.04 v1 --storage remote --target=micro01 --vm
root@micro01:~# lxc shell v1 (from micro01, stay logged in)

In a separate terminal inside micro01 cause a network partition:

root@micro01:~# iptables -A INPUT -j DROP

In a separate terminal inside inside micro02 confirm micro01 is considered offline:

root@micro02:~# lxc cluster list
+---------+---------------------------+-----------------+--------------+----------------+-------------+---------+---------------------------------------------------------------------------+
|  NAME   |            URL            |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION |  STATE  |                                  MESSAGE                                  |
+---------+---------------------------+-----------------+--------------+----------------+-------------+---------+---------------------------------------------------------------------------+
| micro01 | https://10.5.208.61:8443  | database        | x86_64       | default        |             | OFFLINE | No heartbeat for 1m10.571276054s (2024-04-02 08:47:38.40554962 +0000 UTC) |
+---------+---------------------------+-----------------+--------------+----------------+-------------+---------+---------------------------------------------------------------------------+
| micro02 | https://10.5.208.151:8443 | database-leader | x86_64       | default        |             | ONLINE  | Fully operational                                                         |
|         |                           | database        |              |                |             |         |                                                                           |
+---------+---------------------------+-----------------+--------------+----------------+-------------+---------+---------------------------------------------------------------------------+
| micro03 | https://10.5.208.207:8443 | database        | x86_64       | default        |             | ONLINE  | Fully operational                                                         |
+---------+---------------------------+-----------------+--------------+----------------+-------------+---------+---------------------------------------------------------------------------+

Now recover the VM onto micro02:

root@micro02:~# lxc move v1 --target=micro02
root@micro02:~# lxc start v1
# Confirm booting and reachable.

Now on micro01 lets restore network connectivity:

root@micro01:~# iptables -F

Wait for micro01 to be considered back online. In the shell for v1 on micro01 the disk should now be actively blocked from access:

root@v1:~# ls
-bash: /usr/bin/ls: Input/output error
tomponline commented 6 months ago

However we should go forward, as according to the ceph docs:

By default, the exclusive-lock feature does not prevent two or more concurrently running clients from opening the same RBD image and writing to it in turns (whether on the same node or not). In effect, their writes just get linearized as the lock is automatically transitioned back and forth in a cooperative fashion.

So we should ensure LXD maps the disks in exclusive mode:

To disable automatic lock transitions between clients, the RBD_LOCK_MODE_EXCLUSIVE flag may be specified when acquiring the exclusive lock. This is exposed by the --exclusive option for rbd device map command.

However we will need LXD to relax the exclusive lock mode when live migrating instances as this requires the drive to be active on both hosts at the same time.

We also need to check whether this still allows instance recovery in the scenario where a cluster member goes offline or is partitioned unexpectedly.

Similar to https://github.com/lxc/incus/pull/515#event-11883484924

tomponline commented 6 months ago

Additionally we should investigate how LXD can decide to forcefully terminate instances on remote storage running on a cluster member that is isolated from the rest of the cluster such that it doesn't leave instances in a running after a network partition.