Closed sbworth closed 7 years ago
The error in your case appears to be "error: device or resource busy". I don't know if @brauner has an idea of what may be causing this on LVM and if we can work around that issue too.
The reason why you have that container listed but not the others is simply because it's the first in alphabetical order. It may be that container which failed to get converted or the next one, I'm not sure.
@brauner did you do LVM upgrade tests from 2.8 to 2.10.1 with running containers? I wonder if it's just an issue of a mountpoint needing some convincing.
I'm pretty sure that this might again be solved by using lazy umounts.
There is a very good chance that rebooting your system will fix the issue as all containers would then get stopped, all mounts undone and so give LXD a clean environment to upgrade on reboot. But if you're willing to wait a bit we can try to fix this for running containers and give you an updated lxd binary to try.
And yes, I've run upgrade tests from 2.8. to 2.10.1 with LVM: https://asciinema.org/a/104652
@sbworth
https://dl.stgraber.org/lxd-3026 sha256: 036428ed214b786c771cfa64dabf888be492a44ec3763607c9656bdc46d850f0
To run:
systemctl stop lxd.service lxd.socket
./lxd --debug --group lxd
Will try as soon as I can get to a console.
Is lxd-3026 intended to fix the problem or just provide diagnostics?
It should fix the problem, or at least get past the error we saw earlier.
What's the output?
I suspect that the point of failure is that the umount didn't succeed because something is holding the mountpoint busy. If that's the case @stgraber's binary should fix this problem.
Sorry, but I will have to put off testing until I have the chance to warn some users, which will likely mean putting off the actual test until tomorrow morning (EST). Thank you for your very rapid response.
@sbworth Note that the upgrade will not stop any running containers, so assuming that your LXD daemon is currently offline, it really can't make things any worse.
OK. Will try momentarily.
# ./lxd --debug --group lxd
INFO[03-06|16:48:08] LXD 2.10.1 is starting in normal mode path=/var/lib/lxd
WARN[03-06|16:48:08] CGroup memory swap accounting is disabled, swap limits will be ignored.
INFO[03-06|16:48:08] Kernel uid/gid map:
INFO[03-06|16:48:08] - u 0 0 4294967295
INFO[03-06|16:48:08] - g 0 0 4294967295
INFO[03-06|16:48:08] Configured LXD uid/gid map:
INFO[03-06|16:48:08] - u 0 100000 65536
INFO[03-06|16:48:08] - g 0 100000 65536
DBUG[03-06|16:48:08] Initializing and checking storage pool "lxd".
DBUG[03-06|16:48:08] Initializing an LVM driver.
DBUG[03-06|16:48:08] Checking LVM storage pool "lxd".
DBUG[03-06|16:48:08] Checked LVM storage pool "lxd".
DBUG[03-06|16:48:08] Applying patch: storage_api
WARN[03-06|16:48:08] Database already contains a valid entry for the storage pool: lxd.
WARN[03-06|16:48:08] Storage volumes database already contains an entry for the container.
error: exit status 5
Exit status 5
means that either the physical volume, volume group or logical volume is not found.
/var/lib/lxd/storage-pools/lxd/containers
now contains:
drwx------+ 5 root root 4096 Jan 2 15:59 apollo3
drwxr-xr-x+ 5 100000 100000 4096 Jan 3 13:28 apps4
drwxr-xr-x+ 4 100000 100000 4096 Nov 24 11:33 apps4-old
I have no container apps4-old; so I am rather confused. Investigating contents.
.
├── apollo3
│ ├── backup.yaml
│ ├── lost+found
│ ├── metadata.yaml
│ ├── rootfs
│ └── templates
├── apps4
│ ├── backup.yaml
│ ├── lost+found
│ ├── metadata.yaml
│ ├── rootfs
│ └── templates
└── apps4-old
├── metadata.yaml
├── rootfs
└── templates
Could it be picking up old data from a previous migration to LVM from dir storage?
Hmm, that's confusing, I wonder if you have that "apps4-old" container in the database but it wasn't showing up due to broken storage.
What does this give you:
sqlite3 /var/lib/lxd/lxd.db "SELECT * FROM containers;"
(You'll need to install the "sqlite3" package if you don't already have it installed)
First line in the table:
3|apps4-old|2|0|0|0|1480005211|2016-11-30 14:52:15.442158778+00:00
No evidence of it in /var/lib/lxd/containers
.
Looking for other strays.
Ok, so that explains the migration problem. Are you seeing any other such containers in the database list?
If not, the following should do the trick to get rid of it from the database, unblocking the migration:
sqlite3 /var/lib/lxd/lxd.db
PRAGMA foreign_keys=ON;
DELETE FROM containers WHERE name="apps4-old";
<ctrl+d> to exit the interactive DB session
Remove apps4-old and another stray, both dating back to days after original installation and probably before migration to LVM backing store.
Ran your lxd again and it died in the same way.
DBUG[03-06|17:12:39] Checking LVM storage pool "lxd".
DBUG[03-06|17:12:39] Checked LVM storage pool "lxd".
DBUG[03-06|17:12:39] Applying patch: storage_api
WARN[03-06|17:12:39] Database already contains a valid entry for the storage pool: lxd.
WARN[03-06|17:12:39] Storage volumes database already contains an entry for the container.
error: exit status 5
Looking for more strays.
Some of my containers mount data from other LVM volumes not managed by LXD. Could that be a problem?
No, container devices don't impact the migration process.
Does the migration occur in alphabetical order?
Yeah, I believe the containers list we fetch from the database is sorted in alphabetical order.
I'm looking at the code now to see where else we're spawning a subprocess without better logging things...
This could be lvrename
failing because the container has already been succesfully moved.
If that's the case I can send a patch.
@sbworth What does "lvs" show you? I'd expect the containers that show up in /var/lib/lxd/storage-pools/containers to have a "containers" prefix to their name while the others don't. Is your first container renamed and not the others?
What are we lvrename
ing?
The two successfully moved containers are now:
lrwxrwxrwx 1 root root 8 Mar 6 16:48 containers_apollo3 -> ../dm-18
lrwxrwxrwx 1 root root 7 Mar 6 16:48 containers_apps4 -> ../dm-8
The others are all untouched.
@sbworth, @stgraber means the output of lvs
on your system so for your LXD vg: sudo lvs lxd
.
Ok, so two of them did get renamed, which is causing the issue. Ok, that makes sense. @brauner is working on a fix for that, I'll give you a new binary build once I have it.
OK. That is just another way of viewing what I showed you:
LV VG Attr LSize Pool Origin Data% Meta%
containers_apollo3 lxd Vwi-aotz-- 10.00g LXDPool 20.69
containers_apps4 lxd Vwi-aotz-- 10.00g LXDPool 53.16
All the other still have no containers prefix.
I just wanted to make sure that we're seeing the right volume group.
I have to catch a train soon; so I will be out of contact for at least 80 minutes. (What TZ are you guys in?) Thanks again for your efforts.
@brauner is in Germany, I'm in Canada (east coast).
https://dl.stgraber.org/lxd-3026 sha256: ab1874f7f33761196969a80c398c2ce542b271499179973770624e038738e37f
Updated version for that last patch.
https://dl.stgraber.org/lxd-3026 sha256: e22e69ea1dae2f113677b89e13d8f571367ec5fde90675e3b3e00c239d7ec44b
And another one to include a few extra checks that @brauner added.
https://dl.stgraber.org/lxd-3026 sha256: 41c7c751ce952989144224ffcc855b0f1883ef123727f35518c3620c02fd04af
@sbworth let us know when you're back. We're hoping we can include this fix in LXD 2.11 (to be released tomorrow).
And again with improved cleanup on failure.
https://dl.stgraber.org/lxd-3026 sha256: d8b098bc0177a8ab10b6c4e86b1fbbec93b43983267f9f314f3153838f7edab7
Latest trial gets further, but still dies with exit 5. Migrated 4 more containers before dying, but still many more to go. Will continue looking for obvious problem sources.
(Sorry for getting back so late. I have a one year old at home. Plans evolve rapidly.)
@sbworth, if possible it would be great if you could provide the new output after having run the latest binary that @stgraber uploaded. This would involve:
What we'd need now is to find out why there is still an exit status 5.
I merged @brauner's fixes since those are good to have anyway. I'm preparing a commit which will convert our codebase to actually log the command on failure, a binary with that applied should make failure a bit more understandable.
Sorry if I was not clear, but I ran the latest stgraber patched lxd. That is what gave me 4 more containers before bailing. I will send along the requested data as soon as I pull it together.
This is going to take some time. I am fine with posting the bits that seem relevant, but I am very reluctant to post the entire dataset to a public forum without anonymization and I don't have a script ready to do that. Writing such a script now. Will also enable me to quickly generate summary data for any follow-up tests as well.
@sbworth, oh sure. Don't feel pressured to provide this information directly here. If you'd rather provide it in private (anonymized or non-anonymized) to @stgraber and me you can just send a private mail.
I have a LXD build which should add much of the command logging now, lets see if that tells us what's going on.
@sbworth https://dl.stgraber.org/lxd-3026 sha256: 648d9999383cd87e5412204b3c0962ddec0984a752d2995f27db06ca4c215c5a
Is it expected that LXD will migrate the images folder as well? If I am interpreting this correctly, I think I have old pre-LVM images living side-by-side with the symlinks to the logical volumes that should currently represent them.
-rw-r--r-- 1 root root 844 Mar 3 21:37 11fc1b1d39b9f9cd7e9491871f1421ac4278e1d599ecf5d180f2a6e2483bd172
lrwxrwxrwx 1 root root 73 Mar 3 21:41 11fc1b1d39b9f9cd7e9491871f1421ac4278e1d599ecf5d180f2a6e2483bd172.lv -> /dev/lxd/11fc1b1d39b9f9cd7e9491871f1421ac4278e1d599ecf5d180f2a6e2483bd172
-rw-r--r-- 1 root root 153494344 Mar 3 21:40 11fc1b1d39b9f9cd7e9491871f1421ac4278e1d599ecf5d180f2a6e2483bd172.rootfs
-rw-r--r-- 1 root root 199032954 Jan 10 10:45 18e7ed74d0d653894f65343afbc35b92c6781933c273943d882c36a5c5535533
lrwxrwxrwx 1 root root 73 Jan 10 10:45 18e7ed74d0d653894f65343afbc35b92c6781933c273943d882c36a5c5535533.lv -> /dev/lxd/18e7ed74d0d653894f65343afbc35b92c6781933c273943d882c36a5c5535533
-rw-r--r-- 1 root root 600 Jan 9 13:47 457a80ea4720900b69e5542cea5351f58021331bc96e773e4855a3e2ce1e6595
lrwxrwxrwx 1 root root 73 Jan 9 13:47 457a80ea4720900b69e5542cea5351f58021331bc96e773e4855a3e2ce1e6595.lv -> /dev/lxd/457a80ea4720900b69e5542cea5351f58021331bc96e773e4855a3e2ce1e6595
-rw-r--r-- 1 root root 98614308 Jan 9 13:47 457a80ea4720900b69e5542cea5351f58021331bc96e773e4855a3e2ce1e6595.rootfs
Sorry, my interpretation must be wrong. The time stamps all indicate recent access. Apologies for publically arguing with myself.
Yeah, LXD will migrate images too, but it does that after it's done with containers and snapshots.
When it does, /var/lib/lxd should only contain the image files themselves and the mountpoint for the LVs will be created under /var/lib/lxd/storage-pools/images/ and those symlinks should then disappear.
Hello,
I believe that I have a variant of the problem seen in issue #3024 which I have been following with interest. After upgrade to 2.10.1 from 2.8.x, lxd cannot start up.
Required information
Issue description
I have two systems, sys1 and sys2. Sys1 is using dir storage, while sys2 is using LVM.
With sys1, I migrated from 2.8.x to 2.9.x and then to 2.10.x. After resolving an issue with a change in profile inheritance of the disk device after the 2.9.x upgrade, sys1 seems to have upgraded to 2.10.x ok.
With sys2, I migrated directly from 2.8.x to 2.10.x. This was inadvertent, as I had just sorted out the 2.9.x issue on sys1 and intended to move sys2 to 2.9.x. When lxd attempted to restart, the lxc command line client stopped responding.
Checking /var/log/lxd/lxd.log, we see:
journalctl -u lxd
Sample of
/var/lib/lxd/containers
:File tree listing of
/var/lib/lxd/storage-pools
:That is, the storage-pools area is empty. (Were the container rootfs links supposed to be migrated to
storage-pools
?)The images area seems untouched:
Output from
pvs
andvgs
and -- highly edited for readability -- output fromlvs
:Data from lxd.db:
It looks somewhat odd to me that host astro3 has an entry in the storage_volumes tables when nothing else does. It does differ in being a privileged container.
Any help you can provide to get regular access restored will be greatly appreciated. For the moment, the containers continue to provide their services. Let me know if I can provide any other useful data or perform any non-destructive tests.