canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.29k stars 923 forks source link

Cant start more than 7 instances of same local base image I have published #11167

Closed omani closed 1 year ago

omani commented 1 year ago

when I do lxc start instance08 I get no output so I assume it worked. but lxc ls instance08 shows state STOPPED. no logs, nothing.

any hints?

omani commented 1 year ago
$ lxc ls
+--------+---------+------------------------------+------+-----------+-----------+
|  NAME  |  STATE  |             IPV4             | IPV6 |   TYPE    | SNAPSHOTS |
+--------+---------+------------------------------+------+-----------+-----------+
| gala01 | RUNNING | 172.18.0.1 (br-14fa7068c6d1) |      | CONTAINER | 0         |
|        |         | 172.17.0.1 (docker0)         |      |           |           |
|        |         | 10.226.7.89 (eth0)           |      |           |           |
+--------+---------+------------------------------+------+-----------+-----------+
| gala02 | RUNNING | 172.18.0.1 (br-14fa7068c6d1) |      | CONTAINER | 0         |
|        |         | 172.17.0.1 (docker0)         |      |           |           |
|        |         | 10.226.7.75 (eth0)           |      |           |           |
+--------+---------+------------------------------+------+-----------+-----------+
| gala03 | RUNNING | 172.18.0.1 (br-14fa7068c6d1) |      | CONTAINER | 0         |
|        |         | 172.17.0.1 (docker0)         |      |           |           |
|        |         | 10.226.7.138 (eth0)          |      |           |           |
+--------+---------+------------------------------+------+-----------+-----------+
| gala04 | RUNNING | 172.18.0.1 (br-14fa7068c6d1) |      | CONTAINER | 0         |
|        |         | 172.17.0.1 (docker0)         |      |           |           |
|        |         | 10.226.7.102 (eth0)          |      |           |           |
+--------+---------+------------------------------+------+-----------+-----------+
| gala05 | RUNNING | 172.18.0.1 (br-14fa7068c6d1) |      | CONTAINER | 0         |
|        |         | 172.17.0.1 (docker0)         |      |           |           |
|        |         | 10.226.7.221 (eth0)          |      |           |           |
+--------+---------+------------------------------+------+-----------+-----------+
| gala06 | RUNNING | 172.18.0.1 (br-14fa7068c6d1) |      | CONTAINER | 0         |
|        |         | 172.17.0.1 (docker0)         |      |           |           |
|        |         | 10.226.7.87 (eth0)           |      |           |           |
+--------+---------+------------------------------+------+-----------+-----------+
| gala07 | RUNNING | 172.18.0.1 (br-14fa7068c6d1) |      | CONTAINER | 0         |
|        |         | 172.17.0.1 (docker0)         |      |           |           |
|        |         | 10.226.7.151 (eth0)          |      |           |           |
+--------+---------+------------------------------+------+-----------+-----------+
| gala08 | STOPPED |                              |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
$
$ lxc image ls
+-----------+--------------+--------+-------------------------------------+--------------+-----------+-----------+------------------------------+
|   ALIAS   | FINGERPRINT  | PUBLIC |             DESCRIPTION             | ARCHITECTURE |   TYPE    |   SIZE    |         UPLOAD DATE          |
+-----------+--------------+--------+-------------------------------------+--------------+-----------+-----------+------------------------------+
| gala-node | 778e1e41211d | yes    | Ubuntu jammy amd64 (20221127_07:42) | x86_64       | CONTAINER | 1243.23MB | Nov 28, 2022 at 6:13pm (UTC) |
+-----------+--------------+--------+-------------------------------------+--------------+-----------+-----------+------------------------------+
|           | 9b2c8b8a1870 | no     | Alpine 3.16 amd64 (20221127_13:00)  | x86_64       | CONTAINER | 2.50MB    | Nov 28, 2022 at 4:06pm (UTC) |
+-----------+--------------+--------+-------------------------------------+--------------+-----------+-----------+------------------------------+
|           | 309874cb3bac | no     | Ubuntu jammy amd64 (20221127_07:42) | x86_64       | CONTAINER | 114.77MB  | Nov 28, 2022 at 3:18pm (UTC) |
+-----------+--------------+--------+-------------------------------------+--------------+-----------+-----------+------------------------------+
$
$ lxc launch gala-node test
Creating test
Retrieving image: Unpack: 33% (6.46MB/s) <- I pasted this too, so you know it is unpacking.
Starting test
$
$ lxc ls test
+------+---------+------+------+-----------+-----------+
| NAME |  STATE  | IPV4 | IPV6 |   TYPE    | SNAPSHOTS |
+------+---------+------+------+-----------+-----------+
| test | STOPPED |      |      | CONTAINER | 0         |
+------+---------+------+------+-----------+-----------+
$
$ lxc info --show-log test
Name: test
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2022/11/29 00:00 +03
Last Used: 2022/11/29 00:03 +03

Log:

lxc gala_test 20221128210301.852 WARN     cgfsng - ../src/lxc/cgroups/cgfsng.c:cgfsng_setup_limits_legacy:3147 - Invalid argument - Ignoring legacy cgroup limits on pure cgroup2 system
lxc gala_test 20221128210301.954 WARN     cgfsng - ../src/lxc/cgroups/cgfsng.c:cgfsng_setup_limits_legacy:3147 - Invalid argument - Ignoring legacy cgroup limits on pure cgroup2 system
lxc 20221128210302.471 ERROR    af_unix - ../src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20221128210302.471 ERROR    commands - ../src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_state"

^ these warnings are normal. they also happen on other running nodes.

I would expect some logs, or some error message why the container could not start.

again. let me start this container:

$ lxc start test
$ lxc ls test
+------+---------+------+------+-----------+-----------+
| NAME |  STATE  | IPV4 | IPV6 |   TYPE    | SNAPSHOTS |
+------+---------+------+------+-----------+-----------+
| test | STOPPED |      |      | CONTAINER | 0         |
+------+---------+------+------+-----------+-----------+

nothing. since lxc start test does not output anything I expect it to work (unix philosophy, dont output anything if success.).

but this is confusing.

omani commented 1 year ago
$ lxc version
Client version: 5.5
Server version: 5.5
tomponline commented 1 year ago

Version 5.5 isnt supported, please can you confirm this occurs still with lxd 5.8 and we can reopen. Thanks

tomponline commented 1 year ago

Also please look for any errors in dmesg and syslog that may indicate the problem. You may be hitting a sysctl limit.

omani commented 1 year ago

Version 5.5 isnt supported, please can you confirm this occurs still with lxd 5.8 and we can reopen. Thanks

5.8 is not available on Alpine Linux yet. Im on Alpine 3.17. only 5.0.1 (lxd) or 5.5 (lxd-feature) package available.

not even edge or testing has it. have to wait. or maybe I will build it from source.

tomponline commented 1 year ago

Ok. We generally ask that support issues (especially those to do with specific environments) be posted at https://discuss.linuxcontainers.org/ first and we can always promote to github if it is a bug.

Please can you post there. Also alpine carries lxd 5.0.1 LTS too so you could try that too as that would give another data point. That version is also supported.

omani commented 1 year ago

Ive built LXD release 5.8 from source on Alpine Linux 3.17.

I encounter the same problem:

$ lxc ls
+--------+---------+------------------------------+------+-----------+-----------+
|  NAME  |  STATE  |             IPV4             | IPV6 |   TYPE    | SNAPSHOTS |
+--------+---------+------------------------------+------+-----------+-----------+
| gala01 | RUNNING | 172.18.0.1 (br-fee7b61d8331) |      | CONTAINER | 0         |
|        |         | 172.17.0.1 (docker0)         |      |           |           |
|        |         | 10.245.2.243 (eth0)          |      |           |           |
+--------+---------+------------------------------+------+-----------+-----------+
| gala02 | RUNNING | 10.245.2.196 (eth0)          |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala03 | RUNNING | 10.245.2.158 (eth0)          |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala04 | STOPPED |                              |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala05 | RUNNING | 10.245.2.37 (eth0)           |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala06 | RUNNING | 10.245.2.240 (eth0)          |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala07 | RUNNING | 10.245.2.45 (eth0)           |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala08 | RUNNING | 10.245.2.244 (eth0)          |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala09 | RUNNING | 10.245.2.67 (eth0)           |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala10 | RUNNING | 10.245.2.27 (eth0)           |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala11 | RUNNING | 10.245.2.10 (eth0)           |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala12 | RUNNING | 10.245.2.249 (eth0)          |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala13 | RUNNING | 10.245.2.247 (eth0)          |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala14 | STOPPED |                              |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala15 | STOPPED |                              |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala16 | STOPPED |                              |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala17 | STOPPED |                              |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala18 | STOPPED |                              |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala19 | STOPPED |                              |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+
| gala20 | STOPPED |                              |      | CONTAINER | 0         |
+--------+---------+------------------------------+------+-----------+-----------+

gala01 has been deployed with docker etc. hence the IPs. but all other nodes are pure ubuntu nodes. nothing done on them. just launched and nothing else.

Ive started the nodes with this simple for loop:

for i in `seq -w 2 20`; do lxc launch images:ubuntu/22.04 gala$i; done

many nodes are just in STOPPED state. when I start eg. gala20, nothing happens:

$ lxc start gala20
$ lxc ls gala20
+--------+---------+------+------+-----------+-----------+
|  NAME  |  STATE  | IPV4 | IPV6 |   TYPE    | SNAPSHOTS |
+--------+---------+------+------+-----------+-----------+
| gala20 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+

lxc start gala20 gives the impression that the starting of the node was successful. but an immediate lxc ls of that node shows that state is still STOPPED.

the log for that node shows this:

# tail -f /var/log/lxd/gala_gala20/lxc.log
lxc gala_gala20 20221129005309.578 WARN     cgfsng - ../src/lxc/cgroups/cgfsng.c:cgfsng_setup_limits_legacy:3147 - Invalid argument - Ignoring legacy cgroup limits on pure cgroup2 system
lxc gala_gala20 20221129005309.186 WARN     cgfsng - ../src/lxc/cgroups/cgfsng.c:cgfsng_setup_limits_legacy:3147 - Invalid argument - Ignoring legacy cgroup limits on pure cgroup2 system
lxc 20221129005309.488 ERROR    af_unix - ../src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20221129005309.488 ERROR    commands - ../src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_state"
# lxd version
5.8
# lxc version
Client version: 5.8
Server version: 5.8

there are no log entries in /var/log/messages regarding this at all. I doubt that I hit any sysctl limits. the only thing I did was launching 20 ubuntu nodes. though this error message Failed to receive file descriptors for command "get_state" could be a hint that I ran out of file descriptors. after 12 running nodes?

omani commented 1 year ago

I thought I try to reproduce this on another machine of mine (local machine).

now when I want to launch an ubuntu image I get:

$ lxc launch images:ubuntu/22.04 u1
Creating u1
Error: Failed instance creation: Unable to fetch https://images.linuxcontainers.org/images/ubuntu/jammy/amd64/default/20221127_07:42/lxd.tar.xz: 404 Not Found

^ what's up with this messed up URL anyway?

this is a completely different machine. alpine 3.16. lxd 5.0 (apk add lxd).

it's getting really annoying. I get more and more disappointed with this piece of software. what's going on here? the only thing I did with LXD was troubleshooting it (for 10 hours now). error after error.

I was 0 productive today with LXD. bad experience. Im a big fan of LXD but this is a no-go (Im talking to myself).

omani commented 1 year ago

follow up: I was afk for 1 hour and came back and thought why not try again and now suddenly it works:

lxc launch images:ubuntu/22.04 u1
Creating u1
Starting u1

it retrieved and downloaded the rootfs from https://images.linuxcontainers.org. just like that. I didnt do anything.

omani commented 1 year ago

ok here the result on my local machine:

$ for i in `seq -w 1 20`; do sudo lxc launch images:ubuntu/22.04 gala$i; done
Creating gala01
Starting gala01
Creating gala02
Starting gala02
Creating gala03
Starting gala03
Creating gala04
Starting gala04
Creating gala05
Starting gala05
Creating gala06
Starting gala06
Creating gala07
Starting gala07
Creating gala08
Starting gala08
Creating gala09
Starting gala09
Creating gala10
Starting gala10
Creating gala11
Starting gala11
Creating gala12
Starting gala12
Creating gala13
Starting gala13
Creating gala14
Starting gala14
Creating gala15
Starting gala15
Creating gala16
Starting gala16
Creating gala17
Starting gala17
Creating gala18
Starting gala18
Creating gala19
Starting gala19
Creating gala20
Starting gala20
+--------+---------+----------------------+------+-----------+-----------+
|  NAME  |  STATE  |         IPV4         | IPV6 |   TYPE    | SNAPSHOTS |
+--------+---------+----------------------+------+-----------+-----------+
| gala01 | RUNNING | 10.177.95.199 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala02 | RUNNING | 10.177.95.119 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala03 | RUNNING | 10.177.95.197 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala04 | RUNNING | 10.177.95.140 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala05 | RUNNING | 10.177.95.135 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala06 | RUNNING | 10.177.95.181 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala07 | RUNNING | 10.177.95.122 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala08 | RUNNING | 10.177.95.159 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala09 | RUNNING | 10.177.95.168 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala10 | RUNNING | 10.177.95.188 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala11 | RUNNING | 10.177.95.193 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala12 | RUNNING | 10.177.95.149 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala13 | RUNNING | 10.177.95.182 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala14 | STOPPED |                      |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala15 | STOPPED |                      |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala16 | STOPPED |                      |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala17 | STOPPED |                      |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala18 | STOPPED |                      |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala19 | STOPPED |                      |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala20 | STOPPED |                      |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+

same effect. it stopped working after 13 containers.

$ lxc version
Client version: 4.0.9
Server version: 4.0.9

doesnt matter which version it is.

please somebody try to reproduce this. loop over 20 machines, launch ubuntu and see what happens.

tomponline commented 1 year ago

We do indeed do this sort of thing, for hundreds of containers, concurrently, everyday as part of our automated performance tests. So this will most likely be something environmental in your setup which is why it is more appropriate to post over at https://discuss.linuxcontainers.org/ to get support.

I'll try and reproduce.

tomponline commented 1 year ago

Our build infrastructure is currently moving which may explain the 404 error on the image.

tomponline commented 1 year ago

As you're not using the snap package but instead a third party package please can you provide the commands/steps you used to setup lxd on alpine. Along with the image you are using and the config for the instances (lxc config show (instance) --expanded))

omani commented 1 year ago
$ lxc start --debug gala13
DBUG[11-29|16:50:49] Connecting to a local LXD over a Unix socket
DBUG[11-29|16:50:49] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
DBUG[11-29|16:50:49] Got response struct from LXD
DBUG[11-29|16:50:49]
        {
                "config": {
                        "images.auto_update_interval": "0"
                },
                "api_extensions": [
                        "storage_zfs_remove_snapshots",
                        "container_host_shutdown_timeout",
                        "container_stop_priority",
                        "container_syscall_filtering",
                        "auth_pki",
                        "container_last_used_at",
                        "etag",
                        "patch",
                        "usb_devices",
                        "https_allowed_credentials",
                        "image_compression_algorithm",
                        "directory_manipulation",
                        "container_cpu_time",
                        "storage_zfs_use_refquota",
                        "storage_lvm_mount_options",
                        "network",
                        "profile_usedby",
                        "container_push",
                        "container_exec_recording",
                        "certificate_update",
                        "container_exec_signal_handling",
                        "gpu_devices",
                        "container_image_properties",
                        "migration_progress",
                        "id_map",
                        "network_firewall_filtering",
                        "network_routes",
                        "storage",
                        "file_delete",
                        "file_append",
                        "network_dhcp_expiry",
                        "storage_lvm_vg_rename",
                        "storage_lvm_thinpool_rename",
                        "network_vlan",
                        "image_create_aliases",
                        "container_stateless_copy",
                        "container_only_migration",
                        "storage_zfs_clone_copy",
                        "unix_device_rename",
                        "storage_lvm_use_thinpool",
                        "storage_rsync_bwlimit",
                        "network_vxlan_interface",
                        "storage_btrfs_mount_options",
                        "entity_description",
                        "image_force_refresh",
                        "storage_lvm_lv_resizing",
                        "id_map_base",
                        "file_symlinks",
                        "container_push_target",
                        "network_vlan_physical",
                        "storage_images_delete",
                        "container_edit_metadata",
                        "container_snapshot_stateful_migration",
                        "storage_driver_ceph",
                        "storage_ceph_user_name",
                        "resource_limits",
                        "storage_volatile_initial_source",
                        "storage_ceph_force_osd_reuse",
                        "storage_block_filesystem_btrfs",
                        "resources",
                        "kernel_limits",
                        "storage_api_volume_rename",
                        "macaroon_authentication",
                        "network_sriov",
                        "console",
                        "restrict_devlxd",
                        "migration_pre_copy",
                        "infiniband",
                        "maas_network",
                        "devlxd_events",
                        "proxy",
                        "network_dhcp_gateway",
                        "file_get_symlink",
                        "network_leases",
                        "unix_device_hotplug",
                        "storage_api_local_volume_handling",
                        "operation_description",
                        "clustering",
                        "event_lifecycle",
                        "storage_api_remote_volume_handling",
                        "nvidia_runtime",
                        "container_mount_propagation",
                        "container_backup",
                        "devlxd_images",
                        "container_local_cross_pool_handling",
                        "proxy_unix",
                        "proxy_udp",
                        "clustering_join",
                        "proxy_tcp_udp_multi_port_handling",
                        "network_state",
                        "proxy_unix_dac_properties",
                        "container_protection_delete",
                        "unix_priv_drop",
                        "pprof_http",
                        "proxy_haproxy_protocol",
                        "network_hwaddr",
                        "proxy_nat",
                        "network_nat_order",
                        "container_full",
                        "candid_authentication",
                        "backup_compression",
                        "candid_config",
                        "nvidia_runtime_config",
                        "storage_api_volume_snapshots",
                        "storage_unmapped",
                        "projects",
                        "candid_config_key",
                        "network_vxlan_ttl",
                        "container_incremental_copy",
                        "usb_optional_vendorid",
                        "snapshot_scheduling",
                        "snapshot_schedule_aliases",
                        "container_copy_project",
                        "clustering_server_address",
                        "clustering_image_replication",
                        "container_protection_shift",
                        "snapshot_expiry",
                        "container_backup_override_pool",
                        "snapshot_expiry_creation",
                        "network_leases_location",
                        "resources_cpu_socket",
                        "resources_gpu",
                        "resources_numa",
                        "kernel_features",
                        "id_map_current",
                        "event_location",
                        "storage_api_remote_volume_snapshots",
                        "network_nat_address",
                        "container_nic_routes",
                        "rbac",
                        "cluster_internal_copy",
                        "seccomp_notify",
                        "lxc_features",
                        "container_nic_ipvlan",
                        "network_vlan_sriov",
                        "storage_cephfs",
                        "container_nic_ipfilter",
                        "resources_v2",
                        "container_exec_user_group_cwd",
                        "container_syscall_intercept",
                        "container_disk_shift",
                        "storage_shifted",
                        "resources_infiniband",
                        "daemon_storage",
                        "instances",
                        "image_types",
                        "resources_disk_sata",
                        "clustering_roles",
                        "images_expiry",
                        "resources_network_firmware",
                        "backup_compression_algorithm",
                        "ceph_data_pool_name",
                        "container_syscall_intercept_mount",
                        "compression_squashfs",
                        "container_raw_mount",
                        "container_nic_routed",
                        "container_syscall_intercept_mount_fuse",
                        "container_disk_ceph",
                        "virtual-machines",
                        "image_profiles",
                        "clustering_architecture",
                        "resources_disk_id",
                        "storage_lvm_stripes",
                        "vm_boot_priority",
                        "unix_hotplug_devices",
                        "api_filtering",
                        "instance_nic_network",
                        "clustering_sizing",
                        "firewall_driver",
                        "projects_limits",
                        "container_syscall_intercept_hugetlbfs",
                        "limits_hugepages",
                        "container_nic_routed_gateway",
                        "projects_restrictions",
                        "custom_volume_snapshot_expiry",
                        "volume_snapshot_scheduling",
                        "trust_ca_certificates",
                        "snapshot_disk_usage",
                        "clustering_edit_roles",
                        "container_nic_routed_host_address",
                        "container_nic_ipvlan_gateway",
                        "resources_usb_pci",
                        "resources_cpu_threads_numa",
                        "resources_cpu_core_die",
                        "api_os",
                        "resources_system",
                        "usedby_consistency",
                        "resources_gpu_mdev",
                        "console_vga_type",
                        "projects_limits_disk",
                        "storage_rsync_compression",
                        "gpu_mdev",
                        "resources_pci_iommu",
                        "resources_network_usb",
                        "resources_disk_address",
                        "network_state_vlan",
                        "gpu_sriov",
                        "migration_stateful",
                        "disk_state_quota",
                        "storage_ceph_features",
                        "gpu_mig",
                        "clustering_join_token",
                        "clustering_description",
                        "server_trusted_proxy",
                        "clustering_update_cert",
                        "storage_api_project",
                        "server_instance_driver_operational",
                        "server_supported_storage_drivers",
                        "event_lifecycle_requestor_address",
                        "resources_gpu_usb",
                        "network_counters_errors_dropped",
                        "image_source_project",
                        "database_leader",
                        "instance_all_projects",
                        "ceph_rbd_du",
                        "qemu_metrics",
                        "gpu_mig_uuid",
                        "event_project",
                        "instance_allow_inconsistent_copy",
                        "image_restrictions"
                ],
                "api_status": "stable",
                "api_version": "1.0",
                "auth": "trusted",
                "public": false,
                "auth_methods": [
                        "tls"
                ],
                "environment": {
                        "addresses": [],
                        "architectures": [
                                "x86_64",
                                "i686"
                        ],
                        "certificate": "-----BEGIN CERTIFICATE-----\nbla\n-----END CERTIFICATE-----\n",
                        "certificate_fingerprint": "aa60457f61be62bea58b40bff7e075a7afb6c049b77a343bfacfe414acb3fb7a",
                        "driver": "lxc | qemu",
                        "driver_version": "4.0.12 | 7.0.0",
                        "firewall": "nftables",
                        "kernel": "Linux",
                        "kernel_architecture": "x86_64",
                        "kernel_features": {
                                "netnsid_getifaddrs": "true",
                                "seccomp_listener": "true",
                                "seccomp_listener_continue": "true",
                                "shiftfs": "false",
                                "uevent_injection": "true",
                                "unpriv_fscaps": "true"
                        },
                        "kernel_version": "5.15.78-0-lts",
                        "lxc_features": {
                                "cgroup2": "true",
                                "core_scheduling": "true",
                                "devpts_fd": "true",
                                "idmapped_mounts_v2": "true",
                                "mount_injection_file": "true",
                                "network_gateway_device_route": "true",
                                "network_ipvlan": "true",
                                "network_l2proxy": "true",
                                "network_phys_macvlan_mtu": "true",
                                "network_veth_router": "true",
                                "pidfd": "true",
                                "seccomp_allow_deny_syntax": "true",
                                "seccomp_notify": "true",
                                "seccomp_proxy_send_notify_fd": "true"
                        },
                        "os_name": "Alpine Linux",
                        "os_version": "3.16.3",
                        "project": "default",
                        "server": "lxd",
                        "server_clustered": false,
                        "server_name": "home",
                        "server_pid": 7111,
                        "server_version": "4.0.9",
                        "storage": "dir",
                        "storage_version": "1",
                        "storage_supported_drivers": [
                                {
                                        "Name": "dir",
                                        "Version": "1",
                                        "Remote": false
                                }
                        ]
                }
        }
DBUG[11-29|16:50:49] Sending request to LXD                   method=GET url=http://unix.socket/1.0/instances/gala13 etag=
DBUG[11-29|16:50:49] Got response struct from LXD
DBUG[11-29|16:50:49]
        {
                "architecture": "x86_64",
                "config": {
                        "image.architecture": "amd64",
                        "image.description": "Ubuntu jammy amd64 (20221127_07:42)",
                        "image.os": "Ubuntu",
                        "image.release": "jammy",
                        "image.serial": "20221127_07:42",
                        "image.type": "squashfs",
                        "image.variant": "default",
                        "volatile.base_image": "309874cb3bac23616ebca180db7b6d1f151175869e716d079cb28e1a103a143c",
                        "volatile.eth0.hwaddr": "00:16:3e:6e:32:f9",
                        "volatile.idmap.base": "0",
                        "volatile.idmap.current": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
                        "volatile.idmap.next": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
                        "volatile.last_state.idmap": "[]",
                        "volatile.last_state.power": "STOPPED",
                        "volatile.uuid": "908a2585-00e5-44ed-83dd-08c6aca08a9d"
                },
                "devices": {},
                "ephemeral": false,
                "profiles": [
                        "default"
                ],
                "stateful": false,
                "description": "",
                "created_at": "2022-11-29T03:33:29.353883422Z",
                "expanded_config": {
                        "image.architecture": "amd64",
                        "image.description": "Ubuntu jammy amd64 (20221127_07:42)",
                        "image.os": "Ubuntu",
                        "image.release": "jammy",
                        "image.serial": "20221127_07:42",
                        "image.type": "squashfs",
                        "image.variant": "default",
                        "volatile.base_image": "309874cb3bac23616ebca180db7b6d1f151175869e716d079cb28e1a103a143c",
                        "volatile.eth0.hwaddr": "00:16:3e:6e:32:f9",
                        "volatile.idmap.base": "0",
                        "volatile.idmap.current": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
                        "volatile.idmap.next": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
                        "volatile.last_state.idmap": "[]",
                        "volatile.last_state.power": "STOPPED",
                        "volatile.uuid": "908a2585-00e5-44ed-83dd-08c6aca08a9d"
                },
                "expanded_devices": {
                        "eth0": {
                                "name": "eth0",
                                "network": "lxdbr0",
                                "type": "nic"
                        },
                        "root": {
                                "path": "/",
                                "pool": "default",
                                "type": "disk"
                        }
                },
                "name": "gala13",
                "status": "Stopped",
                "status_code": 102,
                "last_used_at": "2022-11-29T13:47:22.125783839Z",
                "location": "none",
                "type": "container",
                "project": "default"
        }
DBUG[11-29|16:50:49] Connected to the websocket: ws://unix.socket/1.0/events
DBUG[11-29|16:50:49] Sending request to LXD                   method=PUT url=http://unix.socket/1.0/instances/gala13/state etag=
DBUG[11-29|16:50:49]
        {
                "action": "start",
                "timeout": 0,
                "force": false,
                "stateful": false
        }
DBUG[11-29|16:50:49] Got operation from LXD
DBUG[11-29|16:50:49]
        {
                "id": "514d49c6-6d56-437d-a7d2-cecfa2c22b47",
                "class": "task",
                "description": "Starting instance",
                "created_at": "2022-11-29T16:50:49.386043636+03:00",
                "updated_at": "2022-11-29T16:50:49.386043636+03:00",
                "status": "Running",
                "status_code": 103,
                "resources": {
                        "instances": [
                                "/1.0/instances/gala13"
                        ]
                },
                "metadata": null,
                "may_cancel": false,
                "err": "",
                "location": "none"
        }
DBUG[11-29|16:50:49] Sending request to LXD                   method=GET url=http://unix.socket/1.0/operations/514d49c6-6d56-437d-a7d2-cecfa2c22b47 etag=
DBUG[11-29|16:50:49] Got response struct from LXD
DBUG[11-29|16:50:49]
        {
                "id": "514d49c6-6d56-437d-a7d2-cecfa2c22b47",
                "class": "task",
                "description": "Starting instance",
                "created_at": "2022-11-29T16:50:49.386043636+03:00",
                "updated_at": "2022-11-29T16:50:49.386043636+03:00",
                "status": "Running",
                "status_code": 103,
                "resources": {
                        "instances": [
                                "/1.0/instances/gala13"
                        ]
                },
                "metadata": null,
                "may_cancel": false,
                "err": "",
                "location": "none"
        }
$ lxc ls gala13 --debug
...
  {
                        "architecture": "x86_64",
                        "config": {
                                "image.architecture": "amd64",
                                "image.description": "Ubuntu jammy amd64 (20221127_07:42)",
                                "image.os": "Ubuntu",
                                "image.release": "jammy",
                                "image.serial": "20221127_07:42",
                                "image.type": "squashfs",
                                "image.variant": "default",
                                "volatile.base_image": "309874cb3bac23616ebca180db7b6d1f151175869e716d079cb28e1a103a143c",
                                "volatile.eth0.hwaddr": "00:16:3e:6e:32:f9",
                                "volatile.idmap.base": "0",
                                "volatile.idmap.current": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
                                "volatile.idmap.next": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
                                "volatile.last_state.idmap": "[]",
                                "volatile.last_state.power": "STOPPED",
                                "volatile.uuid": "908a2585-00e5-44ed-83dd-08c6aca08a9d"
                        },
                        "devices": {},
                        "ephemeral": false,
                        "profiles": [
                                "default"
                        ],
                        "stateful": false,
                        "description": "",
                        "created_at": "2022-11-29T03:33:29.353883422Z",
                        "expanded_config": {
                                "image.architecture": "amd64",
                                "image.description": "Ubuntu jammy amd64 (20221127_07:42)",
                                "image.os": "Ubuntu",
                                "image.release": "jammy",
                                "image.serial": "20221127_07:42",
                                "image.type": "squashfs",
                                "image.variant": "default",
                                "volatile.base_image": "309874cb3bac23616ebca180db7b6d1f151175869e716d079cb28e1a103a143c",
                                "volatile.eth0.hwaddr": "00:16:3e:6e:32:f9",
                                "volatile.idmap.base": "0",
                                "volatile.idmap.current": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
                                "volatile.idmap.next": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
                                "volatile.last_state.idmap": "[]",
                                "volatile.last_state.power": "STOPPED",
                                "volatile.uuid": "908a2585-00e5-44ed-83dd-08c6aca08a9d"
                        },
                        "expanded_devices": {
                                "eth0": {
                                        "name": "eth0",
                                        "network": "lxdbr0",
                                        "type": "nic"
                                },
                                "root": {
                                        "path": "/",
                                        "pool": "default",
                                        "type": "disk"
                                }
                        },
                        "name": "gala13",
                        "status": "Stopped",
                        "status_code": 102,
                        "last_used_at": "2022-11-29T13:50:49.47375582Z",
                        "location": "none",
                        "type": "container",
                        "project": "default"
                },

...

again, notice how lxc start gets reponse RUNNING back from API but a following lxc ls $name show state STOPPED in the response.

tomponline commented 1 year ago

Please can you advise on the steps you used to install and configure LXD on Alpine? So we can attempt to reproduce.

omani commented 1 year ago

@tomponline we can do an SSH session together in tmux or screen (whatever you prefer) on the live system so you can have a look around for yourself.

omani commented 1 year ago

Please can you advise on the steps you used to install and configure LXD on Alpine? So we can attempt to reproduce.

nothing fancy. installed alpine 3.17 and invoked the command: apk add lxdand you get version 5.0.1. I also tried 5.8 by building from source. which I did according to the official "Installation" docs.

regardless, in both versions I see the same issue. that is on the server. plus, on my local machine, completely different version, I see the same issue as well.

omani commented 1 year ago

hence my offer to do a live interactive SSH session together on the system that has this issue. we can treat it as our lab environment. Im gonna provision the whole server again from scratch anyway, once we've finished.

tomponline commented 1 year ago

I'll try and reproduce later when I am able to work on this.

tomponline commented 1 year ago

We just added the Alpine 3.17 image to our builders so hopefully that will be built soon and I can just use that to spin up a test VM.

omani commented 1 year ago

We just added the Alpine 3.17 image to our builders so hopefully that will be built soon and I can just use that to spin up a test VM.

can you give us details to the build? for example, when I build LXD on alpine I have to manually fix some things by hand. eg:

furthermore, there is an issue with the exports after make deps. etc. without going into details. I guess the same errors you get when building LXD from source on alpine. because the makefile in the LXD repository is not suited for alpine. eg. you have to add -lintl and -luv to the CGO_LDFLAGS.

it would be nice if you could give /etc/apk/world from your minimal build system for Alpine Linux 3.17. and your altered Makefile if possible.

omani commented 1 year ago

oh, would you look at that:

+--------+---------+----------------------+------+-----------+-----------+
|  NAME  |  STATE  |         IPV4         | IPV6 |   TYPE    | SNAPSHOTS |
+--------+---------+----------------------+------+-----------+-----------+
| gala01 | RUNNING | 10.177.95.120 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala02 | RUNNING | 10.177.95.176 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala03 | RUNNING | 10.177.95.129 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala04 | RUNNING | 10.177.95.183 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala05 | RUNNING | 10.177.95.180 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala06 | RUNNING | 10.177.95.152 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala07 | RUNNING | 10.177.95.162 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala08 | RUNNING | 10.177.95.170 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala09 | RUNNING | 10.177.95.163 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala10 | RUNNING | 10.177.95.128 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala11 | RUNNING | 10.177.95.171 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala12 | RUNNING | 10.177.95.127 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala13 | RUNNING | 10.177.95.113 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala14 | RUNNING | 10.177.95.187 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala15 | RUNNING | 10.177.95.143 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala16 | RUNNING | 10.177.95.160 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala17 | RUNNING | 10.177.95.111 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala18 | RUNNING | 10.177.95.110 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala19 | RUNNING | 10.177.95.141 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+
| gala20 | RUNNING | 10.177.95.144 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+

this is done with:

for i in `seq -w 1 20`; do lxc launch images:alpine/3.16 gala$i; done

that means, alpine images work! but not ubuntu.

now look at this:

$ for i in `seq -w 1 20`; do lxc launch images:ubuntu/22.04 gala$i; done
+--------+---------+------+------+-----------+-----------+
|  NAME  |  STATE  | IPV4 | IPV6 |   TYPE    | SNAPSHOTS |
+--------+---------+------+------+-----------+-----------+
| gala01 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala02 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala03 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala04 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala05 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala06 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala07 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala08 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala09 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala10 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala11 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala12 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala13 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala14 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala15 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala16 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala17 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala18 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala19 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+
| gala20 | STOPPED |      |      | CONTAINER | 0         |
+--------+---------+------+------+-----------+-----------+

all in STOPPED state!

same for ubuntu/20.04.

so that means I could boil this down to the images. the question is: what is my LXD host missing that it is not able to start an ubuntu image. what does the ubuntu image need from the host? maybe some packages missing on the LXD host?

this issue is reproducable for me.

omani commented 1 year ago

/var/log/lxd/gala01/console.log seems to have some valuable information:

Failed to look up module alias 'autofs4': Function not implemented
Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted
[ESC[0;1;31m!!!!!!ESC[0m] Failed to mount API filesystems.
Exiting PID 1...

something told me it has to do with systemd. who doesn't like this awesome invention named systemd? so, what can I do to fix this?

tomponline commented 1 year ago

It needs the cgroups for systemd. There's a setting in the alpine lxc init script for that.

tomponline commented 1 year ago

See https://git.alpinelinux.org/aports/tree/main/lxc/lxc.confd?h=3.17-stable

omani commented 1 year ago

let's see how I can fix this. I see an issue regarding this https://github.com/lxc/lxc/issues/4072. since I dont have grub on alpine 3.17 I will go with the solution mentioned in the last comment in that issue.

mkdir -p /sys/fs/cgroup/systemd && mount -t cgroup cgroup -o none,name=systemd /sys/fs/cgroup/systemd

$ lxc start gala01
$ lxc ls gala01
+--------+---------+----------------------+------+-----------+-----------+
|  NAME  |  STATE  |         IPV4         | IPV6 |   TYPE    | SNAPSHOTS |
+--------+---------+----------------------+------+-----------+-----------+
| gala01 | RUNNING | 10.177.95.200 (eth0) |      | CONTAINER | 0         |
+--------+---------+----------------------+------+-----------+-----------+

there you go!

tomponline commented 1 year ago

Although it doesnt explain why you were able to start 7 of them before, otherwise I would have suggested this sooner, as its a common issue for those running ubuntu on alpine.

omani commented 1 year ago

Although it doesnt explain why you were able to start 7 of them before, otherwise I would have suggested this sooner, as its a common issue for those running ubuntu on alpine.

unfortunate.

but since I've built from source, I dont have /etc/init.d/lx*.

is there another option for me? except the above step with creating it manually.

tomponline commented 1 year ago

Yep see https://github.com/lxc/lxd/issues/11167#issuecomment-1331198814 as if you start the lxc service (without any lxc containers) with that setting it will Mount the cgroups for you.

https://git.alpinelinux.org/aports/tree/main/lxc/lxc.initd?h=3.17-stable#n13

tomponline commented 1 year ago

Yes because its the lxc service from alpine not the lxd service you built manually (its confusing but not something we have control over).

omani commented 1 year ago

yes but it says # Configuration for /etc/init.d/lxc[.*]. I dont have any init.d script. I dont have lxc/lxd services. I've built from source.

but I found these files:

/etc/lxc# ll
total 8.0K
-rw-r--r-- 1 root root 23 Mar 28  2022 default.conf.apk-new
-rw-r--r-- 1 root root 91 Aug  5 15:56 default.conf

/etc/lxc# cat default.conf
lxc.net.0.type = empty
lxc.idmap = u 0 100000 1000000000
lxc.idmap = g 0 100000 1000000000

is it possible in this file?

tomponline commented 1 year ago

I suggest you stick with the alpine packages now you know what the issue is.

Its covered in the alpine wiki https://wiki.alpinelinux.org/wiki/LXD

omani commented 1 year ago

Yep see #11167 (comment) as if you start the lxc service (without any lxc containers) with that setting it will Mount the cgroups for you.

https://git.alpinelinux.org/aports/tree/main/lxc/lxc.initd?h=3.17-stable#n13

I missed this comment. thanks for the link. I can create my own openrc script.

thanks for your help.