canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 928 forks source link

Support live migration of VMs with attached volumes #12694

Closed benoitjpnet closed 1 month ago

benoitjpnet commented 8 months ago

Following cluster:

root@mc10:~# lxc cluster ls
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| NAME |            URL            |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE  |      MESSAGE      |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc10 | https://192.168.1.10:8443 | database-leader | x86_64       | default        |             | ONLINE | Fully operational |
|      |                           | database        |              |                |             |        |                   |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc11 | https://192.168.1.11:8443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc12 | https://192.168.1.12:8443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
root@mc10:~# 

I start one VM:

lxc launch ubuntu:22.04 v1 --vm --target mc10

I move it:

root@mc10:~# lxc exec v1 -- uptime
 11:45:17 up 0 min,  0 users,  load average: 0.59, 0.13, 0.04
root@mc10:~# 

root@mc10:~# lxc move v1 --target mc11
Error: Instance move to destination failed: Error transferring instance data: Failed migration on target: Failed getting migration target filesystem connection: websocket: bad handshake
root@mc10:~# 
roosterfish commented 8 months ago

Hi @benoitjpnet, it looks like live migration isn't yet enabled on your cluster. You can confirm by checking the LXD daemons error logs using journalctl -u snap.lxd.daemon.

benoitjpnet commented 8 months ago

The only error I see is:

Jan 02 08:44:47 mc10 lxd.daemon[2134]: time="2024-01-02T08:44:47Z" level=error msg="Failed migration on target" clusterMoveSourceName=builder err="Failed getting migration target filesystem connection: websocket: bad handshake" instance=builder live=true project=default push=false

This lacks a more explicit error message.

But thank you, I re-read the documentation and I missed:

Set migration.stateful to true on the instance.

Then I am doing lxc move v1 --target mc10 but it is stuck. I guess it is not related to Microcloud though.

roosterfish commented 8 months ago

Can you check the logs on both ends (source and target host)? One of them should tell that the migration has to be enabled in the config.

benoitjpnet commented 8 months ago

Concerning the stuck part:

Jan 02 08:50:50 mc10 lxd.daemon[2134]: time="2024-01-02T08:50:50Z" level=warning msg="Unable to use virtio-fs for device, using 9p as a fallback" device=builder_var_lib_laminar driver=disk err="Stateful migration unsupported" instance=builder project=default
Jan 02 08:50:50 mc10 lxd.daemon[2134]: time="2024-01-02T08:50:50Z" level=warning msg="Unable to use virtio-fs for config drive, using 9p as a fallback" err="Stateful migration unsupported" instance=builder instanceType=virtual-machine project=default
Jan 02 08:50:51 mc10 lxd.daemon[2134]: time="2024-01-02T08:50:51Z" level=warning msg="Failed reading from state connection" err="read tcp 192.168.1.10:57884->192.168.1.11:8443: use of closed network connection" instance=builder instanceType=virtual-machine project=default

I use Ceph RBD + CephFS and it seems CephFS is not supported for live migration :(

benoitjpnet commented 8 months ago

Can you check the logs on both ends (source and target host)? One of them should tell that the migration has to be enabled in the config.

I was not able to find such logs/messages.

roosterfish commented 8 months ago

I was able to reproduce the warnings including the hanging migration. I guess you have added a new CephFS storage pool to the MicroCloud cluster and attached one of its volumes to the v1 instance which you are trying to migrate?

@tomponline this looks to be an error on LXD side when migrating VMs that have a CephFS volume attached. Should we block migration of VMs with attached volumes? At least the error from qemu below kind of indicates that this is not supported. Is that the reason why the DiskVMVirtiofsdStart function returns Stateful migration unsupported?

On the source host you can see the following log messages:

Jan 02 13:11:21 m2 lxd.daemon[7034]: time="2024-01-02T13:11:21Z" level=warning msg="Unable to use virtio-fs for device, using 9p as a fallback" device=vol driver=disk err="Stateful migration unsupported" instance=v1 project=default
Jan 02 13:11:21 m2 lxd.daemon[7034]: time="2024-01-02T13:11:21Z" level=warning msg="Unable to use virtio-fs for config drive, using 9p as a fallback" err="Stateful migration unsupported" instance=v1 instanceType=virtual-machine project=default
...
Jan 02 13:11:50 m2 lxd.daemon[7034]: time="2024-01-02T13:11:50Z" level=error msg="Failed migration on source" clusterMoveSourceName=v1 err="Failed starting state transfer to target: Migration is disabled when VirtFS export path 'NULL' is mounted in the guest using mount_tag 'lxd_vol'" instance=v1 live=true project=default push=false

On the target side:

Jan 02 13:11:50 m1 lxd.daemon[4537]: time="2024-01-02T13:11:50Z" level=warning msg="Unable to use virtio-fs for device, using 9p as a fallback" device=vol driver=disk err="Stateful migration unsupported" instance=v1 project=default
Jan 02 13:11:50 m1 lxd.daemon[4537]: time="2024-01-02T13:11:50Z" level=warning msg="Unable to use virtio-fs for config drive, using 9p as a fallback" err="Stateful migration unsupported" instance=v1 instanceType=virtual-machine project=default
Jan 02 13:11:50 m1 lxd.daemon[4537]: time="2024-01-02T13:11:50Z" level=warning msg="Failed reading from state connection" err="read tcp 10.171.103.8:38154->10.171.103.138:8443: use of closed network connection" instance=v1 instanceType=virtual-machine project=default
benoitjpnet commented 8 months ago

I was able to reproduce the warnings including the hanging migration. I guess you have added a new CephFS storage pool to the MicroCloud cluster and attached one of its volumes to the v1 instance which you are trying to migrate?

Correct.

tomponline commented 8 months ago

Thanks @roosterfish @benoitjpnet I have moved this to LXD for triaging.

@benoitjpnet can you confirm that live migration works if there is no volume attached?

benoitjpnet commented 8 months ago

Yes it works.

root@mc10:~# lxc launch ubuntu:22.04 v1 --vm --target mc10 -d root,size=10GiB -d root,size.state=4GiB -c limits.memory=4GiB -c limits.cpu=4 -c migration.stateful=true
Creating v1
Starting v1
root@mc10:~# lxc exec v1 -- uptime
 13:10:21 up 0 min,  0 users,  load average: 0.74, 0.19, 0.06
root@mc10:~# lxc move v1 --target mc11
root@mc10:~# lxc exec v1 -- uptime
 13:10:47 up 0 min,  0 users,  load average: 0.49, 0.17, 0.06
root@mc10:~# 
tomponline commented 8 months ago

@MusicDin please can you evaluate what happens when trying to migrate (both live and non-live modes) a VM with custom volumes attached (filesystem and block types) and identify what does and doesn't work.

I suspect we will need quite a bit of work to add support for live-migrating custom block volumes in remote storage, and that live migrating of VMs with custom local volumes isn't going to work either.

So we are likely going to need to land an improvement to detect incompatible scenarios and return a clear error message, and then potentially add a work item for a future roadmap to improve migration support of custom volumes.

tomponline commented 7 months ago

https://github.com/canonical/lxd/pull/12733 improves the error the user sees in this situation.

tomponline commented 2 months ago

Seems relevant https://github.com/lxc/incus/pull/686

tomponline commented 2 months ago

Hi @boltmark as you're working on some migration work wrt to https://github.com/canonical/lxd/pull/13695 I thought it would also be a good opportunity for you to take a look at fixing this issue considering https://github.com/lxc/incus/pull/686