canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.27k stars 911 forks source link

Cannot launch armhf containers on arm64 host under noble #13512

Closed simondeziel closed 1 month ago

simondeziel commented 1 month ago

Originally reported at https://bugs.launchpad.net/ubuntu/+source/lxd/+bug/2062176. It's still not clear if it's an issue with the kernel or LXD itself but since LXD silently fails to bring the armhf container up, there might be something we can improve on the LXD side.

Steps to reproduce

  1. Get an arm64 machine (tf-reserve rpi4b8g https://cdimage.ubuntu.com/ubuntu-server/noble/daily-preinstalled/current/noble-preinstalled-server-arm64+raspi.img.xz will do)
  2. snap install lxd --channel 5.21/stable/ubuntu-24.04
  3. lxd init --auto
  4. lxc launch ubuntu-daily:n nobletest-native should work
  5. lxc launch ubuntu-daily:n/armhf nobletest should fail silently
root@ubuntu:~# lxc launch ubuntu-daily:n nobletest-native
Creating nobletest-native
Starting nobletest-native 
root@ubuntu:~# lxc launch ubuntu-daily:n/armhf nobletest
Creating nobletest
Starting nobletest 

root@ubuntu:~# lxc ls
+------------------+---------+---------------------+-----------------------------------------------+-----------+-----------+
|       NAME       |  STATE  |        IPV4         |                     IPV6                      |   TYPE    | SNAPSHOTS |
+------------------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| nobletest        | STOPPED |                     |                                               | CONTAINER | 0         |
+------------------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| nobletest-native | RUNNING | 10.178.102.5 (eth0) | fd42:15d8:f210:8738:216:3eff:fe52:7025 (eth0) | CONTAINER | 0         |
+------------------+---------+---------------------+-----------------------------------------------+-----------+-----------+

root@ubuntu:~# lxc info nobletest --show-log
Name: nobletest
Status: STOPPED
Type: container
Architecture: armv7l
Created: 2024/05/28 17:28 UTC
Last Used: 2024/05/28 17:28 UTC

Log:

lxc nobletest 20240528172840.614 WARN     idmap_utils - ../src/src/lxc/idmap_utils.c:lxc_map_ids:165 - newuidmap binary is missing
lxc nobletest 20240528172840.617 WARN     idmap_utils - ../src/src/lxc/idmap_utils.c:lxc_map_ids:171 - newgidmap binary is missing
lxc nobletest 20240528172840.659 WARN     idmap_utils - ../src/src/lxc/idmap_utils.c:lxc_map_ids:165 - newuidmap binary is missing
lxc nobletest 20240528172840.662 WARN     idmap_utils - ../src/src/lxc/idmap_utils.c:lxc_map_ids:171 - newgidmap binary is missing

Trying to start the armhf instance also fails very early on:

root@ubuntu:~# lxc start --console nobletest
To detach from the console, press: <ctrl>+a q
Error: write /dev/pts/ptmx: file already closed

root@ubuntu:~# lxc start --debug nobletest
...
DEBUG  [2024-05-28T17:32:18Z] Sending request to LXD                        etag= method=PUT url="http://unix.socket/1.0/instances/nobletest/state"
DEBUG  [2024-05-28T17:32:18Z] 
    {
        "action": "start",
        "timeout": 0,
        "force": false,
        "stateful": false
    } 
DEBUG  [2024-05-28T17:32:18Z] Got operation from LXD                       
DEBUG  [2024-05-28T17:32:18Z] 
    {
        "id": "76558225-cb46-41d0-ba9e-d2677daee561",
        "class": "task",
        "description": "Starting instance",
        "created_at": "2024-05-28T17:32:18.180377726Z",
        "updated_at": "2024-05-28T17:32:18.180377726Z",
        "status": "Running",
        "status_code": 103,
        "resources": {
            "instances": [
                "/1.0/instances/nobletest"
            ]
        },
        "metadata": null,
        "may_cancel": false,
        "err": "",
        "location": "none"
    } 
DEBUG  [2024-05-28T17:32:18Z] Sending request to LXD                        etag= method=GET url="http://unix.socket/1.0/operations/76558225-cb46-41d0-ba9e-d2677daee561"
DEBUG  [2024-05-28T17:32:18Z] Got response struct from LXD                 
DEBUG  [2024-05-28T17:32:18Z] 
    {
        "id": "76558225-cb46-41d0-ba9e-d2677daee561",
        "class": "task",
        "description": "Starting instance",
        "created_at": "2024-05-28T17:32:18.180377726Z",
        "updated_at": "2024-05-28T17:32:18.180377726Z",
        "status": "Running",
        "status_code": 103,
        "resources": {
            "instances": [
                "/1.0/instances/nobletest"
            ]
        },
        "metadata": null,
        "may_cancel": false,
        "err": "",
        "location": "none"
    } 

and there is nothing obvious in the logs:

May 28 17:31:17 ubuntu systemd[1]: Started snap.lxd.lxc-ebf5640d-2266-4451-8cf1-bca181a07096.scope.
May 28 17:31:18 ubuntu systemd-networkd[837]: veth458a0fe0: Link UP
May 28 17:31:18 ubuntu kernel: lxdbr0: port 2(veth458a0fe0) entered blocking state
May 28 17:31:18 ubuntu kernel: lxdbr0: port 2(veth458a0fe0) entered disabled state
May 28 17:31:18 ubuntu kernel: veth458a0fe0: entered allmulticast mode
May 28 17:31:18 ubuntu kernel: veth458a0fe0: entered promiscuous mode
May 28 17:31:18 ubuntu kernel: audit: type=1400 audit(1716917478.411:530): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-nobletest_</var/snap/lxd/common/lxd>" pid=7950 comm="apparmor_parser"
May 28 17:31:18 ubuntu kernel: physlwAGNe: renamed from veth547316e9
May 28 17:31:18 ubuntu kernel: eth0: renamed from physlwAGNe
May 28 17:31:18 ubuntu systemd-networkd[837]: veth458a0fe0: Gained carrier
May 28 17:31:18 ubuntu kernel: lxdbr0: port 2(veth458a0fe0) entered blocking state
May 28 17:31:18 ubuntu kernel: lxdbr0: port 2(veth458a0fe0) entered forwarding state
May 28 17:31:19 ubuntu kernel: lxdbr0: port 2(veth458a0fe0) entered disabled state
May 28 17:31:19 ubuntu kernel: veth547316e9: renamed from physlwAGNe
May 28 17:31:19 ubuntu systemd-networkd[837]: veth458a0fe0: Lost carrier
May 28 17:31:19 ubuntu systemd-networkd[837]: physlwAGNe: Interface name change detected, renamed to veth547316e9.
May 28 17:31:19 ubuntu kernel: veth458a0fe0: left allmulticast mode
May 28 17:31:19 ubuntu kernel: veth458a0fe0: left promiscuous mode
May 28 17:31:19 ubuntu kernel: lxdbr0: port 2(veth458a0fe0) entered disabled state
May 28 17:31:19 ubuntu systemd-networkd[837]: veth458a0fe0: Link UP
May 28 17:31:19 ubuntu systemd-networkd[837]: veth458a0fe0: Link DOWN
May 28 17:31:20 ubuntu systemd[1]: snap.lxd.lxc-ebf5640d-2266-4451-8cf1-bca181a07096.scope: Deactivated successfully.
May 28 17:31:20 ubuntu kernel: audit: type=1400 audit(1716917480.645:531): apparmor="STATUS" operation="profile_remove" profile="unconfined" name="lxd-nobletest_</var/snap/lxd/common/lxd>" pid=8015 comm="apparmor_parser"

Additional information

root@ubuntu:~# snap list lxd core22
Name    Version         Rev    Tracking       Publisher   Notes
core22  20240408        1383   latest/stable  canonical✓  base
lxd     5.21.1-d46c406  28474  5.21/stable/…  canonical✓  -

root@ubuntu:~# uname -a
Linux ubuntu 6.8.0-1004-raspi #4-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 02:29:55 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
simondeziel commented 1 month ago

Same results with LXD 5.0:

root@ubuntu:~# snap list lxd core20
Name    Version        Rev    Tracking       Publisher   Notes
core20  20240416       2321   latest/stable  canonical✓  base
lxd     5.0.3-d921d2e  28384  5.0/stable     canonical✓  -

Installing linux-image-raspi from noble-proposed (6.8.0-1005.5) didn't help LXD 5.0/stable, 5.21/stable nor latest/edge.

simondeziel commented 1 month ago

@mihalicyn I can confirm the original findings that on Mantic it works just fine.

root@ubuntu:~# uname -a
Linux ubuntu 6.5.0-1017-raspi #20-Ubuntu SMP PREEMPT_DYNAMIC Sat May  4 09:13:15 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
root@ubuntu:~# snap list lxd
Name  Version         Rev    Tracking     Publisher   Notes
lxd   5.21.1-d46c406  28474  5.21/stable  canonical✓  -

root@ubuntu:~# lxc launch ubuntu-daily:n/armhf nobletest
Creating nobletest
Starting nobletest                            
root@ubuntu:~# lxc ls
+-----------+---------+----------------------+----------------------------------------------+-----------+-----------+
|   NAME    |  STATE  |         IPV4         |                     IPV6                     |   TYPE    | SNAPSHOTS |
+-----------+---------+----------------------+----------------------------------------------+-----------+-----------+
| nobletest | RUNNING | 10.17.188.166 (eth0) | fd42:30b9:bcd8:ad3:216:3eff:febb:9567 (eth0) | CONTAINER | 0         |
+-----------+---------+----------------------+----------------------------------------------+-----------+-----------+
simondeziel commented 1 month ago

I took the Mantic rpi, do-release-upgraded it to Noble, problem reproduced. Booting that Noble install with Mantic's kernel made it work again.


root@ubuntu:~# lsb_release -sd
No LSB modules are available.
Ubuntu 24.04 LTS
root@ubuntu:~# uname -a
Linux ubuntu 6.5.0-1017-raspi #20-Ubuntu SMP PREEMPT_DYNAMIC Sat May  4 09:13:15 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

root@ubuntu:~# snap list
Name    Version         Rev    Tracking       Publisher   Notes
core22  20240408        1383   latest/stable  canonical✓  base
lxd     5.21.1-d46c406  28474  5.21/stable    canonical✓  -
snapd   2.63            21761  latest/stable  canonical✓  snapd

root@ubuntu:~# lxc ls
+-----------+---------+----------------------+----------------------------------------------+-----------+-----------+
|   NAME    |  STATE  |         IPV4         |                     IPV6                     |   TYPE    | SNAPSHOTS |
+-----------+---------+----------------------+----------------------------------------------+-----------+-----------+
| nobletest | RUNNING | 10.17.188.166 (eth0) | fd42:30b9:bcd8:ad3:216:3eff:febb:9567 (eth0) | CONTAINER | 0         |
+-----------+---------+----------------------+----------------------------------------------+-----------+-----------+
mihalicyn commented 1 month ago

This is the reason: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux-raspi/+git/noble/tree/debian.raspi/config/annotations?h=master-next#n155 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2038582

Minimal reproducer:

# cat test.c
#define _GNU_SOURCE

#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/syscall.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/futex.h>

#define futex(A, B, C, D, E, F) syscall(__NR_futex, A, B, C, D, E, F)

int main(int argc, char **argv)
{
 unsigned int addr = 0;
 long ret;

 ret = futex(&addr, FUTEX_WAKE, 1, NULL, NULL, 0);
 if (ret) {
  printf("Error! %s", strerror(errno));
  exit(1);
 }

 printf("OK!\n");
 return 0;
}
# uname -a
Linux ubuntu 6.8.0-1004-raspi #4-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 02:29:55 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

$ arm-linux-gnueabihf-gcc -static test.c
$ strace -f /usr/arm-linux-gnueabihf/lib/ld-linux-armhf.so.3 ./a.out

futex(0xff83679c, FUTEX_WAKE, 1) = -1 ENOSYS (Function not implemented)
statx(1, "", AT_STATX_SYNC_AS_STAT|AT_NO_AUTOMOUNT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFCHR|0620, stx_size=0, ...}) = 0
write(1, "Error! Function not implemented", 31Error! Function not implemented) = 31
exit_group(1) = ?
+++ exited with 1 +++

This code uses futex_time32: https://github.com/torvalds/linux/blob/4a4be1ad3a6efea16c56615f31117590fd881358/kernel/futex/syscalls.c#L492

mihalicyn commented 1 month ago

Let's close it as it's not a LXD degradation. This was reported back to the original issues in Launchpad (https://bugs.launchpad.net/ubuntu/+source/lxd/+bug/2062176).

cc @tomponline

tomponline commented 1 month ago

Thanks @mihalicyn i concur.

juergh commented 1 month ago

Well, when are you planing to move away from 32bit time? Why not use __NR_futex_time64?

simondeziel commented 1 month ago

Well, when are you planing to move away from 32bit time? Why not use __NR_futex_time64?

@juergh, Assuming I understood everything right, this is not something LXD itself controls. It's the armhf instance that depends on this syscall to be supported by the kernel. Here we have an arm64 kernel trying to start an armhf userspace.

So the question would then be: can the arm64 kernel have COMPAT_32BIT_TIME=y? That would essentially be undoing https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2038582

mihalicyn commented 1 month ago

Well, when are you planing to move away from 32bit time?

armhf is a 32-bit architecture. If you launching armhf container it means that you choosing to run binaries designed to run on 32-bit processor (ARMv7).

If you want to step on 64-bit, you need to use lxc launch ubuntu-daily:n nobletest (64-bit user space) instead of lxc launch ubuntu-daily:n/armhf nobletest (32-bit userspace).

Why not use __NR_futex_time64?

It's not something that LXD controls. When you compile 32-bit binary (armhf) and use futex, then glibc/pthread/etc will use __NR_futex and not __NR_futex_time64.

So the question would then be: can the arm64 kernel have COMPAT_32BIT_TIME=y?

Of course, it can. That's why this thing has prefix COMPAT.

And this change is absolutely safe, it won't force other 64-bit applications (build for arm64) to get back on 32-bit time.

simondeziel commented 1 week ago

An updated kernel (linux-image-6.8.0-1007-raspi) with the needed config enabled just landed in noble-proposed and was marked as verified in https://bugs.launchpad.net/ubuntu/+source/lxd/+bug/2062176. It should be a question of time for it to land in noble-updates.