canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.28k stars 916 forks source link

live migration fails: error: Migration failed on target host: Error transferring container data: x509: certificate is valid for Target_VM, not Source_VM #3948

Closed psinha01 closed 6 years ago

psinha01 commented 6 years ago

About the source and destination VMs (VM1 and VM2):

Distribution: ubuntu Distribution version: ubuntu 17.04 zesty The output of "lxc info" (detailed output at: ) Kernel version: 4.10.0-19-generic LXC version: 2.12 LXD version: 2.12 CRIU version: 3.5 (installed using source code)

About the host (where the source and destination VMs are running):

Distribution: Ubuntu Distribution version: Ubuntu 14.04.5 LTS Kernel version: 4.8.0

Issue description

On a fresh VM ( ubuntu 17.04 ) which is running on the host (ubuntu 14.04.5 LTS), I created a container 'cnt1' with ubuntu:14.04 image and started a user space process (P1) inside this container cnt1. Then, gave the command to migrate cnt1 from the source VM (VM1) to destination VM (VM2) and it fails with following error: Migration failed on target host: Error transferring container data: x509: certificate is valid for VM2, not VM1

Steps to reproduce

  1. Step one: On VM1: lxc launch ubuntu:14.04 cnt1
    1. Step two: On VM1: lxc exec cnt1 -- bash
    2. Step three: On VM1: lxc move cnt1 goo: goo is the remote name for VM2 on VM1.

Information to attach

Log:

        lxc 20171016185401.263 WARN     lxc_start - start.c:signal_handler:322 - Invalid pid for SIGCHLD. Received pid 7817, expected pid 7824.
        lxc 20171016185734.778 ERROR    lxc_criu - criu.c:do_dump:1124 - dump failed with 1
        lxc 20171016185734.778 ERROR    lxc_criu - criu.c:do_dump:1138 - criu output: Will skip in-flight TCP connections
psinha01 commented 6 years ago

If I don't run any process inside the container, there is no error. But I have seen some of the system processes inside the container gets new process IDs after migration. More interesting point: "Sometimes" above mentioned live migration of a container (with a userspace process running inside it) works. But after migration, I don't see that user space process running. I think live migration is killing the process. But since some of the kernel process gets new process IDs, my conclusion is that live migration is basically recreating the whole process tree excluding any user space process. I have also tried alpine/edge instead of ubuntu 14.04 container and I faced the same issue. Kindly help.

stgraber commented 6 years ago

Newer LXD should get you a better error message. You can upgrade with:

apt install -t zesty-backports lxd lxd-client

On both your systems. That should get you LXD 2.18.

I still expect things to fail because of CRIU, but that may get you a slightly better error.

stgraber commented 6 years ago

A few things to note with live migration:

psinha01 commented 6 years ago

Thanks for the reply, I have been struggling with live migration for a while, but it works now. Adding some notes to help other users:

To avoid this error (maybe just a way around): step one: kill all the processes you started inside the container leaving only the system processes in there. step two: do live migration. It will work. But you will see the IPv6 address is empty on the destination. Once you see IPv6 column is empty for your migrated container, you can start processes and try live migration. It will work everytime after that.

Question to @stgraber : is there any way to check total migration time and downtime? Thanks

stgraber commented 6 years ago

Sounds like you're hitting a few CRIU issues around IPv6 handling. I remember reporting a number of those (disappearing address) in the past, but not much progress has been done on that.

As for migration time and downtime, you can time the actual "lxc move" which would be the entire process, including initial fs sync, container stop, state sync and container start.

The downtime should only be the time needed for container stop, state sync and container start, but this can be made much worse depending on your network infrastructure, especially how long it takes for the path to the container address to be learned (ARP and potentially STP at play there).

stgraber commented 6 years ago

Going to close this issue since there's no apparent issue with the way LXD calls into CRIU. Anything after that is usually a CRIU issue. We're happy to chat about those though :)