Closed mszywala closed 3 days ago
Any updates on this topic?
It seems that the Shutdown function is halting here, probably because LXD is not receiving the appropriate hook to finish the operation (neither the stop hook nor the stop namespace). Is is still to be determined what is the inconsistency on the hook.
Took a look at this and found this in /var/snap/lxd/common/lxd/logs/c1/lxc.log
after setting sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon
:
lxc c1 20240719091404.652 DEBUG conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd "default" "c1" stop produced output: /snap/lxd/current/bin/lxd: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /snap/lxd/current/bin/lxd)
lxc c1 20240719091404.652 DEBUG conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd "default" "c1" stop produced output: /snap/lxd/current/bin/lxd: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /snap/lxd/current/bin/lxd)
lxc c1 20240719091404.652 DEBUG conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd "default" "c1" stop produced output: /snap/lxd/current/bin/lxd: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /snap/lxd/current/bin/lxd)
lxc c1 20240719091404.652 DEBUG conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd "default" "c1" stop produced output: /snap/lxd/current/bin/lxd: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /snap/lxd/current/bin/lxd)
lxc c1 20240719091404.653 ERROR conf - ../src/src/lxc/conf.c:run_buffer:322 - Script exited with status 1
lxc c1 20240719091404.653 ERROR start - ../src/src/lxc/start.c:lxc_end:985 - Failed to run lxc.hook.post-stop for container "c1"
Looks like it could be to do with the core base change from 5.0 to 5.21 (core20 to core22).
@simondeziel @mihalicyn btw ive confirmed this is also an issue going from 5.0/stable to 5.0/edge so it'll need to be sorted before we put out 5.0.4 with core22.
@tomponline Any chance for a fix here? This effectively blocks users from updating LXD on productive machines.
Yep, a good chance, its on my list as a high priority, we know what the issue is and have a plan to address it.
Working on a fix here https://github.com/canonical/lxd/pull/14182
The issue here is that between LXD 5.0.x and LXD 5.21.0 onwards the base snap version has changed from core20 to core22.
This has involved changing the glibc version and LXD dependencies have changed version too.
LXD is reloaded when a snap is upgraded and so it takes on the new dependencies of the core snap being used because the LD_LIBRARY_PATH environment variable is updated to point to the new revision.
However the running containers are not restarted, and so their LXC monitor process continues to have the old LD_LIBRARY_PATH environment variable.
The problem is that when eventually the running containers are stopped/restarted, they need to notify the LXD daemon that they have stopped so that the host-side clean up can occur (things like removing the host-side NIC interfaces, unmounting volume(s) and removing unix devices).
The way this works normally is that LXD receives a notification on its local unix socket from the LXC monitor process because when it starts the container it instructs the monitor process to call lxd callhook .. {stopns|stop}
which in turn connects to the LXD unix socket to signal the container is stopping.
LXD then takes that notification, creates/continues a stop operation, performs the host-side cleanup and when done completes the stop operation.
Unfortunately due to the base snap change from core20 to core22 the lxd callhook
command fails to run because it cannot fulfill its dynamic dependencies due to the core22 libs not being compatible with the staged libs its finding due to the old LD_LIBRARY_PATH
env var it was started with.
The long-term fix for this is to update the LXD snap so that it doesn't start LXD (and the LXC monitor process) with an LD_LIBRARY_PATH
containing the specific snap revision, but instead use /snap/lxd/current
prefix so that we always use the current snap's libraries (both core and staged).
Unfortunately this cannot be changed for running containers, so a workaround was attempted by creating a small 2-line wrapper script called lxd-lxc-stop
that simply forced the LD_LIBRARY_PATH to the /snap/lxd/current
prefix and then run LXD from /snap/lxd/current/bin/lxd
. This works, even when using the old environment variables from LXD 5.0 snap.
The intention was to then add a start time patch to LXD to update the stop hook commands for any running containers to use the new wrapper script.
Unfortunately it was discovered by @mihalicyn that liblxc does not allow updating the stop hook paths after the container has started, meaning that we cannot instruct the running containers to use the new wrapper script when they are eventually stopped.
We will need to think of another solution.
This is a problem because we are also planning to change the base snap of LXD 5.0.x to core22 which would then introduce the same issue for a point release upgrade.
I think one of possible solutions is to overmount /snap/lxd/current/bin/lxd
during snap refresh with a shell script which you posted in https://github.com/canonical/lxd/pull/14182. That's what we have in lxc container generated config:
lxc.hook.pre-start = /proc/55526/exe callhook /var/snap/lxd/common/lxd "default" "j" start
lxc.hook.stop = /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd "default" "j" stopns
lxc.hook.post-stop = /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd "default" "j" stop
Overmount should work pretty well.
Edit: or even simpler. We can put an LXD binary let say in bin/_lxd
and shell script in bin/lxd
with the following content (not tested):
#!/bin/sh
# override LD_LIBRARY_PATH only for callhook subcommand
if [ "$1" == "callhook" ]; then
export LD_LIBRARY_PATH="/snap/lxd/current/lib:/snap/lxd/current/lib/${ARCH}"
fi
/snap/lxd/current/bin/_lxd "$@"
Good idea, I'll try that. I might just remove the callhook conditional entirely and force the same set of LD_LIBRARY_PATHs for all invocations.
I prefer that idea to overmounting as I would like to avoid introducing additional mounts inside the snap mount namespace if I can avoid it!
Good idea, I'll try that. I might just remove the callhook conditional entirely and force the same set of LD_LIBRARY_PATHs for all invocations.
yeah, but then LD_LIBRARY_PATH
has to be specified the same way we do it in https://github.com/canonical/lxd-pkg-snap/blob/74dd6f841241bc4de268b1e082c46aab4c40c40f/snapcraft/commands/daemon.start#L27 to cover all cases like with ZFS/no ZFS/with ceph/with GPU snap and so on.
I prefer that idea to overmounting as I would like to avoid introducing additional mounts inside the snap mount namespace if I can avoid it!
Sure, cause if we do overmounting we have to do a proper cleanup during further refreshes. Additional complexity..
The fix worked but introduced other issues, need to revisit.
The fix for this is in LXD 6.2 currently.
Required information
Output of "lxc info"
Issue description
We're on the way to upgrading our servers from LXD 5.0.3 to LXD 5.21.1 LTS. I somehow experienced the issue that the stop and restart container operations were hanging after the refresh. I only encountered the issue for the first stop/restart after
snap refresh lxd --channel=5.21/stable
It seems not to make a difference if-f
is used to stop or restart the instance.After a while, the instance stops with the following error:
Additionally, we are using a routed nic for our container instances.
Because the instance is not stopping properly, there are remaining routes and the veth Device of the instance is in the state "LOWERLAYERDOWN" and the instance is refusing to start back up. When we manage to shut down an instance, it is necessary to take down the veth interface of the instance with
ip link set vethxxxx
down, and we also have to clean up the remaining routes on the hostsystemExpectation:
lxc stop
andlxc restart
is working properly after a refresh.Steps to reproduce
snap refresh lxd --channel=5.21/stable
lxc stop <containerName>
Information to attach
dmesg
) Nothing relevant herelxc info NAME --show-log
)Resources: Processes: 0 CPU usage: CPU usage (in seconds): 0
Log:
lxc u 20240415120709.358 WARN conf - ../src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing lxc u 20240415120709.358 WARN conf - ../src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing lxc u 20240415120709.359 WARN conf - ../src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing lxc u 20240415120709.359 WARN conf - ../src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing lxc u 20240415120831.600 ERROR conf - ../src/src/lxc/conf.c:run_buffer:322 - Script exited with status 1 lxc u 20240415120831.601 ERROR start - ../src/src/lxc/start.c:lxc_end:944 - Failed to run "lxc.hook.stop" hook lxc u 20240415120832.126 ERROR conf - ../src/src/lxc/conf.c:run_buffer:322 - Script exited with status 1 lxc u 20240415120832.126 ERROR start - ../src/src/lxc/start.c:lxc_end:985 - Failed to run lxc.hook.post-stop for container "u"
architecture: x86_64 config: image.architecture: amd64 image.description: ubuntu 22.04 LTS amd64 (release) (20240319) image.label: release image.os: ubuntu image.release: jammy image.serial: "20240319" image.type: squashfs image.version: "22.04" volatile.base_image: ca08f2f359079cdb7fb2d4efa882df5974304d7ae9bb13cbac4904a1347626be volatile.cloud-init.instance-id: fd955459-ecfc-4d00-932e-896aa690f8f5 volatile.idmap.base: "0" volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]' volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]' volatile.last_state.idmap: '[]' volatile.last_state.power: RUNNING volatile.uuid: 72c6dd51-b44c-42ef-ac72-4064385a644e volatile.uuid.generation: 72c6dd51-b44c-42ef-ac72-4064385a644e devices: root: path: / pool: default type: disk ephemeral: false profiles:
location: none metadata: context: ip: '@' method: GET protocol: unix url: /1.0 username: root level: debug message: Handling API request timestamp: "2024-04-15T12:08:28.290919569Z" type: logging
location: none metadata: context: ip: '@' method: GET protocol: unix url: /1.0/events username: root level: debug message: Handling API request timestamp: "2024-04-15T12:08:28.292459798Z" type: logging
location: none metadata: context: id: 7206ca2e-6647-493d-acb0-9f2b739ad80f local: /var/snap/lxd/common/lxd/unix.socket remote: '@' level: debug message: Event listener server handler started timestamp: "2024-04-15T12:08:28.292742589Z" type: logging
location: none metadata: context: ip: '@' method: PUT protocol: unix url: /1.0/instances/u/state username: root level: debug message: Handling API request timestamp: "2024-04-15T12:08:28.293482232Z" type: logging
location: none metadata: class: task created_at: "2024-04-15T12:08:28.29445839Z" description: Stopping instance err: "" id: f74b35a7-28f7-4da7-a3a2-945f79b9d29a location: none may_cancel: false metadata: null resources: instances:
location: none metadata: context: class: task description: Stopping instance operation: f74b35a7-28f7-4da7-a3a2-945f79b9d29a project: default level: debug message: New operation timestamp: "2024-04-15T12:08:28.296017929Z" type: logging
location: none metadata: class: task created_at: "2024-04-15T12:08:28.29445839Z" description: Stopping instance err: "" id: f74b35a7-28f7-4da7-a3a2-945f79b9d29a location: none may_cancel: false metadata: null resources: instances:
location: none metadata: context: class: task description: Stopping instance operation: f74b35a7-28f7-4da7-a3a2-945f79b9d29a project: default level: debug message: Started operation timestamp: "2024-04-15T12:08:28.29604927Z" type: logging
location: none metadata: context: ip: '@' method: GET protocol: unix url: /1.0/operations/f74b35a7-28f7-4da7-a3a2-945f79b9d29a username: root level: debug message: Handling API request timestamp: "2024-04-15T12:08:28.297029355Z" type: logging
location: none metadata: context: instance: u instanceType: container project: default timeout: 10m0s level: debug message: Shutdown started timestamp: "2024-04-15T12:08:28.305543127Z" type: logging
location: none metadata: context: action: shutdown created: 2024-04-15 09:15:51.927293086 +0000 UTC ephemeral: "false" instance: u instanceType: container project: default timeout: 10m0s used: 2024-04-15 12:07:09.356193457 +0000 UTC level: info message: Shutting down instance timestamp: "2024-04-15T12:08:28.305832124Z" type: logging
location: none metadata: context: action: stop instance: u project: default reusable: "true" level: debug message: Instance operation lock created timestamp: "2024-04-15T12:08:28.305813353Z" type: logging
location: none metadata: context: instance: u instanceType: container project: default level: debug message: Shutdown request sent to instance timestamp: "2024-04-15T12:08:28.306351356Z" type: logging
location: none metadata: context: ip: '@' method: GET protocol: unix url: /1.0 username: root level: debug message: Handling API request timestamp: "2024-04-15T12:10:38.268495437Z" type: logging
location: none metadata: context: ip: '@' method: GET protocol: unix url: /1.0/instances/u?recursion=1 username: root level: debug message: Handling API request timestamp: "2024-04-15T12:10:38.269666709Z" type: logging
location: none metadata: context: driver: dir instance: u pool: default project: default level: debug message: GetInstanceUsage started timestamp: "2024-04-15T12:10:38.275039775Z" type: logging
location: none metadata: context: driver: dir instance: u pool: default project: default level: debug message: GetInstanceUsage finished timestamp: "2024-04-15T12:10:38.275631324Z" type: logging
location: none metadata: context: ip: '@' method: GET protocol: unix url: /1.0/instances/u/logs/lxc.log username: root level: debug message: Handling API request timestamp: "2024-04-15T12:10:38.294387288Z" type: logging
location: none metadata: action: instance-log-retrieved requestor: address: '@' protocol: unix username: root source: /1.0/instances/u/backups/lxc.log project: default timestamp: "2024-04-15T12:10:38.295641303Z" type: lifecycle
location: none metadata: class: task created_at: "2024-04-15T12:08:28.29445839Z" description: Stopping instance err: 'Failed shutting down instance, status is "Running": context deadline exceeded' id: f74b35a7-28f7-4da7-a3a2-945f79b9d29a location: none may_cancel: false metadata: null resources: instances:
location: none metadata: context: instance: u instanceType: container project: default timeout: 10m0s level: debug message: Shutdown finished timestamp: "2024-04-15T12:18:28.307150063Z" type: logging
location: none metadata: context: class: task description: Stopping instance err: 'Failed shutting down instance, status is "Running": context deadline exceeded' operation: f74b35a7-28f7-4da7-a3a2-945f79b9d29a project: default level: debug message: Failure for operation timestamp: "2024-04-15T12:18:28.307227229Z" type: logging
location: none metadata: context: listener: 7206ca2e-6647-493d-acb0-9f2b739ad80f local: /var/snap/lxd/common/lxd/unix.socket remote: '@' level: debug message: Event listener server handler stopped timestamp: "2024-04-15T12:18:28.312781582Z" type: logging