psinha01 commented 6 years ago

About the source and destination VMs (VM1 and VM2):

Distribution: ubuntu Distribution version: ubuntu 17.04 zesty The output of "lxc info" (detailed output at: ) Kernel version: 4.10.0-19-generic LXC version: 2.12 LXD version: 2.12 CRIU version: 3.5 (installed using source code)

About the host (where the source and destination VMs are running):

Distribution: Ubuntu Distribution version: Ubuntu 14.04.5 LTS Kernel version: 4.8.0

Issue description

On a fresh VM ( ubuntu 17.04 ) which is running on the host (ubuntu 14.04.5 LTS), I created a container 'cnt1' with ubuntu:14.04 image and started a user space process (P1) inside this container cnt1. Then, gave the command to migrate cnt1 from the source VM (VM1) to destination VM (VM2) and it fails with following error: Migration failed on target host: Error transferring container data: x509: certificate is valid for VM2, not VM1

Steps to reproduce

Step one: On VM1: lxc launch ubuntu:14.04 cnt1
1. Step two: On VM1: lxc exec cnt1 -- bash
2. Step three: On VM1: lxc move cnt1 goo: goo is the remote name for VM2 on VM1.

Information to attach

[ ] dmesg on VM1:

[ 1249.953128] kauditd_printk_skb: 84 callbacks suppressed
[ 1249.953129] audit: type=1400 audit(1508180040.876:40): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-cnt1_</var/lib/lxd>" pid=7816 comm="apparmor_parser"
[ 1249.995968] lxdbr0: port 3(vethI4YEQX) entered blocking state
[ 1249.995970] lxdbr0: port 3(vethI4YEQX) entered disabled state
[ 1249.996154] device vethI4YEQX entered promiscuous mode
[ 1249.996271] IPv6: ADDRCONF(NETDEV_UP): vethI4YEQX: link is not ready
[ 1250.030349] eth0: renamed from vethHNCEBX
[ 1250.051908] IPv6: ADDRCONF(NETDEV_CHANGE): vethI4YEQX: link becomes ready
[ 1250.051982] lxdbr0: port 3(vethI4YEQX) entered blocking state
[ 1250.051984] lxdbr0: port 3(vethI4YEQX) entered forwarding state
[ 1250.537294] audit: type=1400 audit(1508180041.459:41): apparmor="STATUS" operation="profile_load" label="lxd-cnt1_</var/lib/lxd>//&:lxd-cnt1_<var-lib-lxd>://unconfined" name="/sbin/dhclient" pid=8388 comm="apparmor_parser"
[ 1250.537755] audit: type=1400 audit(1508180041.463:42): apparmor="STATUS" operation="profile_load" label="lxd-cnt1_</var/lib/lxd>//&:lxd-cnt1_<var-lib-lxd>://unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=8388 comm="apparmor_parser"
[ 1250.538169] audit: type=1400 audit(1508180041.463:43): apparmor="STATUS" operation="profile_load" label="lxd-cnt1_</var/lib/lxd>//&:lxd-cnt1_<var-lib-lxd>://unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=8388 comm="apparmor_parser"
[ 1250.563253] audit: type=1400 audit(1508180041.487:44): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-cnt1_<var-lib-lxd>" profile="/sbin/dhclient" name="/dev/pts/1" pid=8576 comm="dhclient" requested_mask="wr" denied_mask="wr" fsuid=100000 ouid=100000
[ 1250.563258] audit: type=1400 audit(1508180041.487:45): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-cnt1_<var-lib-lxd>" profile="/sbin/dhclient" name="/dev/pts/1" pid=8576 comm="dhclient" requested_mask="wr" denied_mask="wr" fsuid=100000 ouid=100000
[ 1250.564634] audit: type=1400 audit(1508180041.487:46): apparmor="STATUS" operation="profile_replace" label="lxd-cnt1_</var/lib/lxd>//&:lxd-cnt1_<var-lib-lxd>://unconfined" name="/sbin/dhclient" pid=8410 comm="apparmor_parser"
[ 1250.565088] audit: type=1400 audit(1508180041.487:47): apparmor="STATUS" operation="profile_replace" label="lxd-cnt1_</var/lib/lxd>//&:lxd-cnt1_<var-lib-lxd>://unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=8410 comm="apparmor_parser"
[ 1250.566067] audit: type=1400 audit(1508180041.487:48): apparmor="STATUS" operation="profile_replace" label="lxd-cnt1_</var/lib/lxd>//&:lxd-cnt1_<var-lib-lxd>://unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=8410 comm="apparmor_parser"
[ 1463.351662] cgroup: new mount options do not match the existing superblock, will be ignored
[ 1463.575375] cgroup: new mount options do not match the existing superblock, will be ignored
[ 1463.775327] cgroup: new mount options do not match the existing superblock, will be ignored

[ ] dmesg on VM2: empty

[ ] /var/log/lxd/lxd.log on VM1:

ephemeral=false lvl=info msg="Creating container" name=cnt1 t=2017-10-16T14:53:56-0400
ephemeral=false lvl=info msg="Created container" name=cnt1 t=2017-10-16T14:53:56-0400
action=start created=2017-10-16T14:53:56-0400 ephemeral=false lvl=info msg="Starting container" name=cnt1 stateful=false t=2017-10-16T14:54:00-0400 used=1969-12-31T19:00:00-0500
action=start created=2017-10-16T14:53:56-0400 ephemeral=false lvl=info msg="Started container" name=cnt1 stateful=false t=2017-10-16T14:54:01-0400 used=1969-12-31T19:00:00-0500
actionscript=true created=2017-10-16T14:53:56-0400 ephemeral=false lvl=info msg="Migrating container" name=cnt1 statedir=/tmp/lxd_checkpoint_254505011 stop=true t=2017-10-16T14:57:33-0400 used=2017-10-16T18:54:00+0000
actionscript=true created=2017-10-16T14:53:56-0400 ephemeral=false lvl=info msg="Failed migrating container" name=cnt1 statedir=/tmp/lxd_checkpoint_254505011 stop=true t=2017-10-16T14:57:34-0400 used=2017-10-16T18:54:00+0000

[ ] /var/log/lxd/lxd.log on VM2:

ephemeral=false lvl=info msg="Creating container" name=cnt1 t=2017-10-16T14:57:14-0400
ephemeral=false lvl=info msg="Created container" name=cnt1 t=2017-10-16T14:57:14-0400
lvl=warn msg="Unable to update backup.yaml at this time." name=cnt1 t=2017-10-16T14:57:14-0400
lvl=warn msg="Unable to update backup.yaml at this time." name=cnt1 t=2017-10-16T14:57:14-0400
err="migration dump failed\n(00.807978) Error (criu/sk-netlink.c:73): The socket has data to read\n(00.808005) Error (criu/cr-dump.c:1347): Dump files (pid: 9424) failed with -1\n(00.820961) Error (criu/cr-dump.c:1697): Dumping FAILED." lvl=eror msg="Error during migration sink" t=2017-10-16T14:57:34-0400
created=2017-10-16T14:57:14-0400 ephemeral=false lvl=info msg="Deleting container" name=cnt1 t=2017-10-16T14:57:34-0400 used=1969-12-31T19:00:00-0500
lvl=eror msg="Rsync receive failed: /tmp/lxd_restore_442957554/: exit status 12: rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]\nrsync error: error in rsync protocol data stream (code 12) at io.c(235) [Receiver=3.1.2]\n" t=2017-10-16T14:57:34-0400
created=2017-10-16T14:57:14-0400 ephemeral=false lvl=info msg="Deleted container" name=cnt1 t=2017-10-16T14:57:35-0400 used=1969-12-31T19:00:00-0500
ephemeral=false lvl=info msg="Creating container" name=cnt1 t=2017-10-16T14:57:35-0400
ephemeral=false lvl=info msg="Created container" name=cnt1 t=2017-10-16T14:57:35-0400
lvl=warn msg="Unable to update backup.yaml at this time." name=cnt1 t=2017-10-16T14:57:35-0400
lvl=warn msg="Unable to update backup.yaml at this time." name=cnt1 t=2017-10-16T14:57:35-0400
err="Unable to connect to: 10.229.88.1:8443" lvl=eror msg="Error during migration sink" t=2017-10-16T14:57:45-0400
created=2017-10-16T14:57:35-0400 ephemeral=false lvl=info msg="Deleting container" name=cnt1 t=2017-10-16T14:57:45-0400 used=1969-12-31T19:00:00-0500
created=2017-10-16T14:57:35-0400 ephemeral=false lvl=info msg="Deleted container" name=cnt1 t=2017-10-16T14:57:45-0400 used=1969-12-31T19:00:00-0500
ephemeral=false lvl=info msg="Creating container" name=cnt1 t=2017-10-16T14:57:45-0400
ephemeral=false lvl=info msg="Created container" name=cnt1 t=2017-10-16T14:57:45-0400
lvl=warn msg="Unable to update backup.yaml at this time." name=cnt1 t=2017-10-16T14:57:46-0400
lvl=warn msg="Unable to update backup.yaml at this time." name=cnt1 t=2017-10-16T14:57:46-0400
err="Unable to connect to: [fd42:f54:6d6a:3674::1]:8443" lvl=eror msg="Error during migration sink" t=2017-10-16T14:57:46-0400
created=2017-10-16T14:57:45-0400 ephemeral=false lvl=info msg="Deleting container" name=cnt1 t=2017-10-16T14:57:46-0400 used=1969-12-31T19:00:00-0500
created=2017-10-16T14:57:45-0400 ephemeral=false lvl=info msg="Deleted container" name=cnt1 t=2017-10-16T14:57:46-0400 used=1969-12-31T19:00:00-0500
ephemeral=false lvl=info msg="Creating container" name=cnt1 t=2017-10-16T14:57:46-0400
ephemeral=false lvl=info msg="Created container" name=cnt1 t=2017-10-16T14:57:46-0400
lvl=warn msg="Unable to update backup.yaml at this time." name=cnt1 t=2017-10-16T14:57:46-0400
lvl=warn msg="Unable to update backup.yaml at this time." name=cnt1 t=2017-10-16T14:57:46-0400
err="x509: certificate is valid for o197, not o196" lvl=eror msg="Error during migration sink" t=2017-10-16T14:57:46-0400
created=2017-10-16T14:57:46-0400 ephemeral=false lvl=info msg="Deleting container" name=cnt1 t=2017-10-16T14:57:46-0400 used=1969-12-31T19:00:00-0500
created=2017-10-16T14:57:46-0400 ephemeral=false lvl=info msg="Deleted container" name=cnt1 t=2017-10-16T14:57:46-0400 used=1969-12-31T19:00:00-0500

[ ] lxc info on VM1: https://pastebin.com/qbd5fGvM
[ ] lxc info on VM2: https://pastebin.com/Z8TtNt7Z

[ ] lxc info cnt1 --show-log


Name: cnt1
Remote: unix:/var/lib/lxd/unix.socket
Architecture: x86_64
Created: 2017/10/16 18:53 UTC
Status: Running
Type: persistent
Profiles: default
Pid: 7824
Ips:
eth0: inet    10.229.88.44    vethI4YEQX
eth0: inet6   fd42:f54:6d6a:3674:216:3eff:febd:4c5b   vethI4YEQX
eth0: inet6   fe80::216:3eff:febd:4c5b    vethI4YEQX
lo:   inet    127.0.0.1
lo:   inet6   ::1
Resources:
Processes: 17
Disk usage:
root: 96.46MB
CPU usage:
CPU usage (in seconds): 356
Memory usage:
Memory (current): 87.87MB
Memory (peak): 115.96MB
Network usage:
eth0:
  Bytes received: 20.12MB
  Bytes sent: 305.64kB
  Packets received: 14261
  Packets sent: 4559
lo:
  Bytes received: 1.27kB
  Bytes sent: 1.27kB
  Packets received: 16
  Packets sent: 16

Log:

        lxc 20171016185401.263 WARN     lxc_start - start.c:signal_handler:322 - Invalid pid for SIGCHLD. Received pid 7817, expected pid 7824.
        lxc 20171016185734.778 ERROR    lxc_criu - criu.c:do_dump:1124 - dump failed with 1
        lxc 20171016185734.778 ERROR    lxc_criu - criu.c:do_dump:1138 - criu output: Will skip in-flight TCP connections

psinha01 commented 6 years ago

If I don't run any process inside the container, there is no error. But I have seen some of the system processes inside the container gets new process IDs after migration. More interesting point: "Sometimes" above mentioned live migration of a container (with a userspace process running inside it) works. But after migration, I don't see that user space process running. I think live migration is killing the process. But since some of the kernel process gets new process IDs, my conclusion is that live migration is basically recreating the whole process tree excluding any user space process. I have also tried alpine/edge instead of ubuntu 14.04 container and I faced the same issue. Kindly help.

stgraber commented 6 years ago

Newer LXD should get you a better error message. You can upgrade with:

apt install -t zesty-backports lxd lxd-client

On both your systems. That should get you LXD 2.18.

I still expect things to fail because of CRIU, but that may get you a slightly better error.

stgraber commented 6 years ago

A few things to note with live migration:

There are a LOT of things which CRIU cannot serialize, if you hit any of those, your container will fail to checkpoint and the live migration will fail
The exact set of what's supported depends on the version of CRIU and kernel. We used to contribute to both to try to make live migration more reliable, but short of commercial engagements we can't justify the rather large amount of time this takes to keep on track.
One thing that may hit you above is that CRIU can only see processes which were spawned from inside the container. If something is spawned as the result of a "lxc exec", it will not count as being in the container and so will be missing upon restore.

psinha01 commented 6 years ago

Thanks for the reply, I have been struggling with live migration for a while, but it works now. Adding some notes to help other users:

As suggested by @stgraber : if you want to restore the process on the destination after live migration, then start that process from inside the container instead of using "lxc exec". You can do that by using ssh login to your container. Follow these steps given by @stgraber to do ssh login to your container.
The configuration and the versions of criu, lxd. kernel and distribution that worked for me is given here
Note: you will still get an error on live migration (may be an issue), but there is a way to avoid it. To generate the error: step one: First configure your VMs setup, create a container with ssh access and start a process inside your container as given on above links. step two: if you try to do live migration of this container, it will fail with an error: error: Migration failed on target host: Error transferring container data: Unable to connect to: [fd42:4b60:10a6:c273::1]:8443

To avoid this error (maybe just a way around): step one: kill all the processes you started inside the container leaving only the system processes in there. step two: do live migration. It will work. But you will see the IPv6 address is empty on the destination. Once you see IPv6 column is empty for your migrated container, you can start processes and try live migration. It will work everytime after that.

Question to @stgraber : is there any way to check total migration time and downtime? Thanks

stgraber commented 6 years ago

Sounds like you're hitting a few CRIU issues around IPv6 handling. I remember reporting a number of those (disappearing address) in the past, but not much progress has been done on that.

As for migration time and downtime, you can time the actual "lxc move" which would be the entire process, including initial fs sync, container stop, state sync and container start.

The downtime should only be the time needed for container stop, state sync and container start, but this can be made much worse depending on your network infrastructure, especially how long it takes for the path to the container address to be learned (ARP and potentially STP at play there).

stgraber commented 6 years ago

Going to close this issue since there's no apparent issue with the way LXD calls into CRIU. Anything after that is usually a CRIU issue. We're happy to chat about those though :)

canonical / lxd

live migration fails: error: Migration failed on target host: Error transferring container data: x509: certificate is valid for Target_VM, not Source_VM #3948

Issue description

Steps to reproduce

Information to attach