canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.35k stars 931 forks source link

Runtime panic after upgrading from lxd 3 to lxd 5.0.3 #13885

Closed endikap100 closed 2 months ago

endikap100 commented 2 months ago

Required information

Issue description

after an update from lxd 3 to 5 lxd panics at startime with the following error:

Aug 07 09:20:50 stark systemd[1]: Starting Service for snap application lxd.activate...
Aug 07 09:20:50 stark lxd.activate[529978]: => Starting LXD activation
Aug 07 09:20:50 stark lxd.activate[529978]: ==> Loading snap configuration
Aug 07 09:20:50 stark lxd.activate[529978]: ==> Checking for socket activation support
Aug 07 09:20:50 stark lxd.activate[529978]: ==> Setting LXD socket ownership
Aug 07 09:20:50 stark lxd.activate[529978]: ==> Setting LXD user socket ownership
Aug 07 09:20:50 stark lxd.activate[529978]: ==> Checking if LXD needs to be activated
Aug 07 09:20:50 stark lxd.activate[530046]: time="2024-08-07T09:20:50Z" level=error msg="Unable to run feature checks during QEMU initialization: QEMU premature exit: exit status 1 (qemu-system-x86_64: -drive if=pflash,format=raw,readonly=on,file=/usr/share/OVMF/OVMF_CODE.fd: Could not open '/usr/share/OVMF/OVMF_CODE.fd': No such file or directory)"
Aug 07 09:20:50 stark lxd.activate[530046]: time="2024-08-07T09:20:50Z" level=warning msg="Instance type not operational" driver=qemu err="QEMU failed to run feature checks" type=virtual-machine
Aug 07 09:20:50 stark lxd.activate[530046]: panic: runtime error: invalid memory address or nil pointer dereference
Aug 07 09:20:50 stark lxd.activate[530046]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x84fd26]
Aug 07 09:20:50 stark lxd.activate[530046]: goroutine 1 [running]:
Aug 07 09:20:50 stark lxd.activate[530046]: database/sql.(*Tx).StmtContext(0xc000592200, {0x1e77978?, 0x309dcc0}, 0x0)
Aug 07 09:20:50 stark lxd.activate[530046]:         /snap/go/10657/src/database/sql/sql.go:2386 +0xc6
Aug 07 09:20:50 stark lxd.activate[530046]: database/sql.(*Tx).Stmt(...)
Aug 07 09:20:50 stark lxd.activate[530046]:         /snap/go/10657/src/database/sql/sql.go:2468
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/canonical/lxd/lxd/db/cluster.Stmt(0x41c2e5?, 0x30?)
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/lxd/db/cluster/stmt.go:51 +0x54
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/canonical/lxd/lxd/db/cluster.GetProjects({0x1e77b38, 0xc000588380}, 0x10?, {0x0, 0x0, 0x45182b?})
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/lxd/db/cluster/projects.mapper.go:133 +0x74
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/canonical/lxd/lxd/db.(*Cluster).InstanceList.func1({0x1e77b38, 0xc000588380}, 0xc000586250)
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/lxd/db/instances.go:243 +0x77
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/canonical/lxd/lxd/db.(*Cluster).transaction.func1.1({0x1e77b38?, 0xc000588380?}, 0xc000588380?)
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/lxd/db/db.go:371 +0x45
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/canonical/lxd/lxd/db/query.Transaction({0x1e77940?, 0x309dcc0?}, 0x0?, 0xc0003f70b8)
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/lxd/db/query/transaction.go:29 +0x1a2
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/canonical/lxd/lxd/db.(*Cluster).transaction.func1()
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/lxd/db/db.go:374 +0x6c
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/canonical/lxd/lxd/db/query.Retry(0xc0003f7190)
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/lxd/db/query/retry.go:28 +0xba
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/canonical/lxd/lxd/db.(*Cluster).retry(0x1d?, 0xc0003f7190)
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/lxd/db/db.go:392 +0x45
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/canonical/lxd/lxd/db.(*Cluster).transaction(0xc00051c5c0, {0x1e77940?, 0x309dcc0}, 0xc0003f7460)
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/lxd/db/db.go:368 +0x97
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/canonical/lxd/lxd/db.(*Cluster).Transaction(0x6?, {0x1e77940?, 0x309dcc0?}, 0x0?)
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/lxd/db/db.go:332 +0xa8
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/canonical/lxd/lxd/db.(*Cluster).InstanceList(0xc0003f7758?, {0x1e77940, 0x309dcc0}, 0xc0003f7750, {0xc0003f7790, 0x1, 0xc00059a270?})
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/lxd/db/instances.go:241 +0x333
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/canonical/lxd/lxd/instance.LoadNodeAll(0xc0005bc000, 0xffffffffffffffff)
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/lxd/instance/instance_utils.go:397 +0x13b
Aug 07 09:20:50 stark lxd.activate[530046]: main.(*cmdActivateifneeded).Run(0x0?, 0x0?, {0x0?, 0x0?, 0x0?})
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/lxd/main_activateifneeded.go:118 +0x451
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/spf13/cobra.(*Command).execute(0xc0004ba600, {0x309dcc0, 0x0, 0x0})
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/.go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983 +0xabc
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/spf13/cobra.(*Command).ExecuteC(0xc0004ba300)
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/.go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115 +0x3ff
Aug 07 09:20:50 stark lxd.activate[530046]: github.com/spf13/cobra.(*Command).Execute(...)
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/.go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039
Aug 07 09:20:50 stark lxd.activate[530046]: main.main()
Aug 07 09:20:50 stark lxd.activate[530046]:         /build/lxd/parts/lxd/src/lxd/main.go:217 +0x1e05
Aug 07 09:20:50 stark lxd.activate[529978]: ====> Activation check failed, forcing activation
Aug 07 09:20:50 stark systemd[1]: snap.lxd.activate.service: Succeeded.
Aug 07 09:20:50 stark systemd[1]: Finished Service for snap application lxd.activate.

Steps to reproduce

  1. Step one
  2. Step two
  3. Step three

Information to attach

tomponline commented 2 months ago

What was your upgrade process?

I believe that upgrading from LXD 3.x to 5.x is not supported and you need to go to 4.0/stable first:

https://discuss.linuxcontainers.org/t/lxd-migrate-allows-migration-from-3-x-to-5-x-and-does-not-warn/14583

endikap100 commented 2 months ago

Hi, the upgrade was not done intentionally, It was done automatically in the background as part of the juju agent upgrade of the machines from juju 2.9 to juju 3. is there any way of recovering it? or do I have to remove and reinstall?

tomponline commented 2 months ago

Have you tried refresh back down to 3.0/stable?

endikap100 commented 2 months ago

it fails

refresh output:

sudo snap refresh lxd --channel=3.0/stable
error: cannot perform the following tasks:
- Start snap "lxd" (11348) services (systemctl command [start snap.lxd.activate.service] failed with exit status 1: stderr:
Job for snap.lxd.activate.service failed because the control process exited with error code.
See "systemctl status snap.lxd.activate.service" and "journalctl -xe" for details.)

journal logs:


Aug 07 10:07:47 stark systemd[1]: Starting Service for snap application lxd.activate...
Aug 07 10:07:47 stark lxd.activate[1441866]: => Starting LXD activation
Aug 07 10:07:47 stark lxd.activate[1441866]: ==> Loading snap configuration
Aug 07 10:07:47 stark lxd.activate[1441866]: ==> Checking for socket activation support
Aug 07 10:07:47 stark lxd.activate[1441866]: ==> Setting LXD socket ownership
Aug 07 10:07:47 stark lxd.activate[1441866]: ==> Checking if LXD needs to be activated
Aug 07 10:07:55 stark lxd.activate[1441925]: Error: Get http://unix.socket/1.0: read unix @->/var/snap/lxd/common/lxd/unix.socket: read: connection reset by peer
Aug 07 10:07:55 stark lxd.activate[1441866]: ====> Activation check failed, forcing activation
Aug 07 10:07:55 stark lxd.activate[1442635]: Job for snap.lxd.daemon.service failed because the control process exited with error code.
Aug 07 10:07:55 stark lxd.activate[1442635]: See "systemctl status snap.lxd.daemon.service" and "journalctl -xe" for details.
Aug 07 10:07:55 stark systemd[1]: snap.lxd.activate.service: Main process exited, code=exited, status=1/FAILURE
Aug 07 10:07:55 stark systemd[1]: snap.lxd.activate.service: Failed with result 'exit-code'.
Aug 07 10:07:55 stark systemd[1]: Failed to start Service for snap application lxd.activate.
Aug 07 10:07:57 stark systemd[1]: Starting Service for snap application lxd.activate...
Aug 07 10:07:57 stark lxd.activate[1442846]: => Starting LXD activation
Aug 07 10:07:57 stark lxd.activate[1442846]: ==> Loading snap configuration
Aug 07 10:07:57 stark lxd.activate[1442846]: ==> Checking for socket activation support
Aug 07 10:07:57 stark lxd.activate[1442846]: ==> Setting LXD socket ownership
Aug 07 10:07:58 stark lxd.activate[1442846]: ==> Setting LXD user socket ownership
Aug 07 10:07:58 stark lxd.activate[1442846]: ==> Checking if LXD needs to be activated
Aug 07 10:07:58 stark lxd.activate[1442921]: time="2024-08-07T10:07:58Z" level=error msg="Unable to run feature checks during QEMU initialization: QEMU premature exit: exit status 1 (qemu-system-x86_64: -drive if=pflash,format=raw,readonly=on,file=/usr/share/OVMF/OVMF_CODE.fd: Could not open '/usr/share/OVMF/OVMF_CODE.fd': No such file or directory)"
Aug 07 10:07:58 stark lxd.activate[1442921]: time="2024-08-07T10:07:58Z" level=warning msg="Instance type not operational" driver=qemu err="QEMU failed to run feature checks" type=virtual-machine
Aug 07 10:07:58 stark lxd.activate[1442921]: panic: runtime error: invalid memory address or nil pointer dereference
Aug 07 10:07:58 stark lxd.activate[1442921]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x84fd26]
Aug 07 10:07:58 stark lxd.activate[1442921]: goroutine 1 [running]:
Aug 07 10:07:58 stark lxd.activate[1442921]: database/sql.(*Tx).StmtContext(0xc0005ae180, {0x1e77978?, 0x309dcc0}, 0x0)
Aug 07 10:07:58 stark lxd.activate[1442921]:         /snap/go/10657/src/database/sql/sql.go:2386 +0xc6
Aug 07 10:07:58 stark lxd.activate[1442921]: database/sql.(*Tx).Stmt(...)
Aug 07 10:07:58 stark lxd.activate[1442921]:         /snap/go/10657/src/database/sql/sql.go:2468
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/canonical/lxd/lxd/db/cluster.Stmt(0x41c2e5?, 0x30?)
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/lxd/db/cluster/stmt.go:51 +0x54
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/canonical/lxd/lxd/db/cluster.GetProjects({0x1e77b38, 0xc0005c2380}, 0x10?, {0x0, 0x0, 0x45182b?})
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/lxd/db/cluster/projects.mapper.go:133 +0x74
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/canonical/lxd/lxd/db.(*Cluster).InstanceList.func1({0x1e77b38, 0xc0005c2380}, 0xc0005942a0)
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/lxd/db/instances.go:243 +0x77
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/canonical/lxd/lxd/db.(*Cluster).transaction.func1.1({0x1e77b38?, 0xc0005c2380?}, 0xc0005c2380?)
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/lxd/db/db.go:371 +0x45
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/canonical/lxd/lxd/db/query.Transaction({0x1e77940?, 0x309dcc0?}, 0x0?, 0xc0003f70b8)
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/lxd/db/query/transaction.go:29 +0x1a2
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/canonical/lxd/lxd/db.(*Cluster).transaction.func1()
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/lxd/db/db.go:374 +0x6c
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/canonical/lxd/lxd/db/query.Retry(0xc0003f7190)
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/lxd/db/query/retry.go:28 +0xba
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/canonical/lxd/lxd/db.(*Cluster).retry(0x1d?, 0xc0003f7190)
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/lxd/db/db.go:392 +0x45
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/canonical/lxd/lxd/db.(*Cluster).transaction(0xc000522000, {0x1e77940?, 0x309dcc0}, 0xc0003f7460)
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/lxd/db/db.go:368 +0x97
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/canonical/lxd/lxd/db.(*Cluster).Transaction(0x6?, {0x1e77940?, 0x309dcc0?}, 0x0?)
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/lxd/db/db.go:332 +0xa8
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/canonical/lxd/lxd/db.(*Cluster).InstanceList(0xc0003f7758?, {0x1e77940, 0x309dcc0}, 0xc0003f7750, {0xc0003f7790, 0x1, 0xc00058a330?})
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/lxd/db/instances.go:241 +0x333
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/canonical/lxd/lxd/instance.LoadNodeAll(0xc0005d6000, 0xffffffffffffffff)
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/lxd/instance/instance_utils.go:397 +0x13b
Aug 07 10:07:58 stark lxd.activate[1442921]: main.(*cmdActivateifneeded).Run(0x0?, 0x0?, {0x0?, 0x0?, 0x0?})
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/lxd/main_activateifneeded.go:118 +0x451
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/spf13/cobra.(*Command).execute(0xc0004b6600, {0x309dcc0, 0x0, 0x0})
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/.go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983 +0xabc
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/spf13/cobra.(*Command).ExecuteC(0xc0004b6300)
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/.go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115 +0x3ff
Aug 07 10:07:58 stark lxd.activate[1442921]: github.com/spf13/cobra.(*Command).Execute(...)
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/.go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039
Aug 07 10:07:58 stark lxd.activate[1442921]: main.main()
Aug 07 10:07:58 stark lxd.activate[1442921]:         /build/lxd/parts/lxd/src/lxd/main.go:217 +0x1e05
Aug 07 10:07:58 stark lxd.activate[1442846]: ====> Activation check failed, forcing activation
Aug 07 10:07:58 stark systemd[1]: snap.lxd.activate.service: Succeeded.
Aug 07 10:07:58 stark systemd[1]: Finished Service for snap application lxd.activate.
tomponline commented 2 months ago

did you try refreshing to 4.0/stable?

endikap100 commented 2 months ago

that worked!!!!!! many many thanks!!! but that raises another question after the refresh to 4.0 the lxd version and snap version is still 5.0.3, that is the expected result?

endikap100 commented 2 months ago

maybe juju upgraded it in the background to 5.0 after fixing it with the refresh to 4.0. many many thanks for the feedback. I will close the issue, thanks!!!

SimonRichardson commented 2 months ago

Juju will have to prevent an upgrade from 3 to 5 directly. If @endikap100 can make a bug about this against Juju, we can prevent it from jumping that far. Either by verifying that you're on 4.0 before moving to 5.0, or by moving them to 4.0 during the migration.

tomponline commented 2 months ago

maybe juju upgraded it in the background to 5.0 after fixing it with the refresh to 4.0.

Yes possibly. What does snap list show now?

@SimonRichardson is there a way to stop Juju performing LXD upgrades and snap channel switches?

Separately, yesterday, there was an internal MM chat wrt to Juju changing the snap channel back to latest/stable after a reboot after it was manually set to an LTS series, such as 5.21/stable.

Is this expected behavior? Because latest/stable isn't an LTS series and so shouldn't be used for production deployments, so it seems undesirable for Juju to be moving deployments to that channel.

tomponline commented 2 months ago

@SimonRichardson for your consideration https://chat.canonical.com/canonical/pl/izmztc5euir93y4kj6nmo7jg3h

endikap100 commented 2 months ago

@tomponline I have made a refresh to 4.0 and then to 5.0, not just to 4.0, there you have the output you have requested.

snap list --all
Name    Version             Rev    Tracking       Publisher   Notes
core    16-2.61.4-20240607  17200  latest/stable  canonical✓  core
core20  20240416            2318   latest/stable  canonical✓  base
lxd     4.0.10-e664786      29619  5.0/stable     canonical✓  disabled
lxd     5.0.3-80aeff7       29351  5.0/stable     canonical✓  -
SimonRichardson commented 2 months ago

@tomponline at one point it was set to the LTS, this was changed to follow a fixed track.

https://github.com/juju/juju/blob/3e4a05807e17b85c768497ed893eb6f7ae479d99/environs/config/config.go#L533-L534

The problem is, that if it's unset prior to the migration (empty string) and then you perform a migration, it will take on the new value. This will end up forcing a new upgrade to whatever the snap channel is set to. That's a pretty rough UX. A possible solution is to prevent the migration of the model unless it is set, or juju inserts the current value it's interacting with. This will then prevent the value from being forced upon a user.

tomponline commented 2 months ago

The problem is, that if it's unset prior to the migration (empty string) and then you perform a migration, it will take on the new value.

Do you know what could have caused it to keep setting it back to latest/stable on a reboot in https://chat.canonical.com/canonical/pl/izmztc5euir93y4kj6nmo7jg3h - the OP suggested it was a default and not something that had been set.

tomponline commented 2 months ago

@SimonRichardson is there a reason Juju is specifying a snap channel at all, and not just going with the default snap channel as specified by LXD via the snap store (which currently defaults to the 5.21 LTS)?

On Noble onwards we also have the lxd-installer package that will trigger installing from the "correct" channel for an OS version when a call to /1.0 via LXD's unix socket is made (or if interacting with lxc CLI).

SimonRichardson commented 2 months ago

@tomponline I'll discuss this with the team.

tomponline commented 2 months ago

@tomponline I'll discuss this with the team.

Cheers, appreciate you helping with this. Thanks