ganto / copr-lxc3

RPM spec files for building lxc-3 on Fedora COPR
MIT License
8 stars 2 forks source link

LXD 3.20 Upgrade Containers Stopped Working #21

Closed mrmateo closed 4 years ago

mrmateo commented 4 years ago

Running Fedora 31 and have LXD installed via the steps on https://copr.fedorainfracloud.org/coprs/ganto/lxc3/ . Everything was working really well although for systemd containers I had to set lxc config set <container-name> raw.lxc 'lxc.init.cmd = /sbin/init systemd.unified_cgroup_hierarchy' for them to be able to work properly (found that command from https://discuss.linuxcontainers.org/t/cgroups-v2-adoption/6074/9 )-- biggest tell that they were not working was that they would not get a ipv4 IP an can only be stopped via --force. That wasn't a big deal since it only needed to be set once per container. I upgraded to LXD 3.20 today and now I cannot get any containers to work even after setting that.

Wanted to post here before asking in the general LXD forum since it may be specific to this distribution for Fedora/CentOS with cgroups2 enabled. I do not want to disable cgroupv2 but can provide any logging or debug output that could be useful just let me know what to run to get the output as I am only a general lxd user.

Thanks!

ganto commented 4 years ago

Hmn, ok. I haven't tried that yet. Can you maybe show me an easy way to reproduce your error?

mrmateo commented 4 years ago

Sure! Here is an example -- let me know if you would like any additional info about my environment although here are some basics:

matt@mateopc:~$ uname -a
Linux mateopc 5.4.17-200.fc31.x86_64 #1 SMP Sat Feb 1 19:00:13 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
matt@mateopc:~$ cat /etc/fedora-release
Fedora release 31 (Thirty One)
matt@mateopc:~$ lxd --version
3.20
matt@mateopc:~$ lxc info
config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses: []
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIICHzCCAaWgAwIBAgIRALaymvg/sD45IaaDgw/wMwwwCgYIKoZIzj0EAwMwNTEc
    MBoGA1UEChMTbGludXhweFwsAZXJzLm9yZzEVMBMGA1UEAwwMcm9vdEBtYXRl
    b3BjMB4XDTIwMDIwNDE1NTczMloXDTMwMDIwMTE1NTczMlowNTEcMBoGA1UEChMT
    bGludXhjb250YWluZXJzOp9yZzEVMBMGA1UEAwwMcm9vdEBtYXRlb3BjMHYwEAYH
    KoZIzj0CAQYFK4EEACIDYgAECcq+kD5wVv/3GXHmi/KnBn0WdCJSOJIV/fWHSRq0
    VMKEYs69+7JE2Wkt4c/7DhVe5kCItenDaouKUk+CYz2JebIwmVxUftdSp3W9Bxkp
    oD49M2lp5xpjv5wRgHNvqNGgo3kwdzAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0lBAww
    CgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADBCBgNVHREEOzA5ggdtYXRlb3BjhwTA
    qAFkhxAmAIgAMAASE5/6EM2IskXzhxAmAIgAMAA1+tz9Xhh8QvoNhwTAqHoBMAoG
    CCqGSM49BAMDA2gAMGUCMQDv48OYKft7JJfMotTH0J5Px3cMV7X3fF7sDN8LAihI9
    +131dBD4p7oIA7qawzzBBmG8CMGt0L0LhdNTEYmu+voBA1eXIV7qsUSpfp0JFbCa
    H1iPDVYOEWAJlEnIG/AU8zWd2w==
    -----END CERTIFICATE-----
  certificate_fingerprint: ea90e400cd68763f4707c454bd15b4163ed14e70c6c99dd77d56a4a6205c5bfe
  driver: lxc
  driver_version: 3.2.1
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "false"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.4.17-200.fc31.x86_64
  lxc_features:
    cgroup2: "false"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    seccomp_notify: "true"
  project: default
  server: lxd
  server_clustered: false
  server_name: mateopc
  server_pid: 4457
  server_version: "3.20"
  storage: btrfs
  storage_version: "5.4"

And here is showing the issue:

matt@mateopc:~$ lxc list
+---------+---------+------+------+-----------+-----------+
|  NAME   |  STATE  | IPV4 | IPV6 |   TYPE    | SNAPSHOTS |
+---------+---------+------+------+-----------+-----------+
| aws-cli | STOPPED |      |      | CONTAINER | 0         |
+---------+---------+------+------+-----------+-----------+
matt@mateopc:~$ lxc launch ubuntu:18.04
Creating the instance
Instance name is: oriented-skunk
Starting oriented-skunk
matt@mateopc:~$ lxc list       
+----------------+---------+------+----------------------------------------------+-----------+-----------+
|      NAME      |  STATE  | IPV4 |                     IPV6                     |   TYPE    | SNAPSHOTS |
+----------------+---------+------+----------------------------------------------+-----------+-----------+
| aws-cli        | STOPPED |      |                                              | CONTAINER | 0         |
+----------------+---------+------+----------------------------------------------+-----------+-----------+
| oriented-skunk | RUNNING |      | fd42:959f:1c1:9599:216:3eff:fe0b:a4a6 (eth0) | CONTAINER | 0         |
+----------------+---------+------+----------------------------------------------+-----------+-----------+
matt@mateopc:~$ lxc exec oriented-skunk -- /bin/bash
root@oriented-skunk:~# ping google.com
ping: google.com: Temporary failure in name resolution
root@oriented-skunk:~# dhclient
cmp: EOF on /tmp/tmp.z5XlifALLq which is empty
System has not been booted with systemd as init system (PID 1). Can't operate.
root@oriented-skunk:~# apt update
Err:1 http://archive.ubuntu.com/ubuntu bionic InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Err:2 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Err:4 http://security.ubuntu.com/ubuntu bionic-security InRelease
  Temporary failure resolving 'security.ubuntu.com'
Reading package lists... Done         
Building dependency tree       
Reading state information... Done
All packages are up to date.
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic-updates/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic-backports/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/bionic-security/InRelease  Temporary failure resolving 'security.ubuntu.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.

Let me know if that helps or if there is anything else I can show.

Thanks!

PS Is there a way to downgrade lxd so that I can try to see if it works on 3.18 still? When I try it says 3.20 is the lowest version in your repo:

matt@mateopc:~$ sudo dnf downgrade lxd
Last metadata expiration check: 2:00:48 ago on Wed 12 Feb 2020 11:51:38 AM MST.
Package lxd of lowest version already installed, cannot downgrade it.
Dependencies resolved.
Nothing to do.
Complete!
XeCycle commented 4 years ago

I ran into a similar problem but without cgroupv2 (on CentOS 7), so I guess it's not specific to v2. That cmdline arg to init trick does not work here, however; I have to set systemd containers to privileged.

Here I found that generated lxc.conf says

lxc.mount.auto = proc:rw sys:rw

however 3.18 from this copr repo, and 3.20 on archlinux, have

lxc.mount.auto = proc:rw sys:rw cgroup:mixed

3.18 on CentOS 7 is cgroup1 only, 3.20 on archlinux gets both v1 and v2. Not sure why it decided to leave out cgroup.

ganto commented 4 years ago

I just installed LXD from the COPR repository on a fresh Fedora 31.

And I also found some issues:

[vagrant@fedora31 ~]$ lxc launch images:fedora/31 f31                                                                                                                                                                                            
Creating f31                                                                                                                                                                                                                                  
Starting f31

[vagrant@fedora31 ~]$ lxc shell f31
Error: Container is not running

[vagrant@fedora31 ~]$ lxc info f31                                                                                                                                                                                                            
Name: f31                                                                                                                                                                                                                                     
Location: none                                                                                                                                                                                                                                
Remote: unix://                                                                                                                                                                                                                               
Architecture: x86_64                                                                                                                                                                                                                          
Created: 2020/02/14 01:12 UTC                                                                                                                                                                                                                 
Status: Stopped                                                                                                        
Type: container                                                                                                        
Profiles: default

The container fails to start without an error message :open_mouth: . Then there is a hint in the console log:

[vagrant@fedora31 ~]$ sudo cat /var/log/lxd/f31/console.log
Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted
[!!!!!!] Failed to mount API filesystems.              
Exiting PID 1...

In contrast to my Fedora 30 systems where I'm currently using LXD, I find different messages in the LXCFS log:

That doesn't look like an issue with the packaging. But with LXCFS/LXD handling cgroup2-only systems.

Work-around As work-around I added systemd.unified_cgroup_hierarchy=0 to the grub command and after a reboot, the container could start and got an IP:

[vagrant@fedora31 ~]$ cat /proc/cmdline 
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-5.4.18-200.fc31.x86_64 root=UUID=95af7b45-6542-4816-9aed-2b70feb90faf ro no_timer_check console=tty1 console=ttyS0,115200n8 net.ifnames=0 biosdevname=0 systemd.unified_cgroup_hierarchy=0

[vagrant@fedora31 ~]$ lxc start f31
[vagrant@fedora31 ~]$ lxc list
+------+---------+---------------------+------+-----------+-----------+
| NAME |  STATE  |        IPV4         | IPV6 |   TYPE    | SNAPSHOTS |
+------+---------+---------------------+------+-----------+-----------+
| f31  | RUNNING | 10.192.200.2 (eth0) |      | CONTAINER | 0         |
+------+---------+---------------------+------+-----------+-----------+

After adding the work-around I also didn't have any issue starting an Ubuntu container:

[vagrant@fedora31 ~]$ lxc launch ubuntu:18.04 bionic
Creating bionic
Starting bionic                             

[vagrant@fedora31 ~]$ lxc list
+--------+---------+-----------------------+-----------------------------------------------+-----------+-----------+
|  NAME  |  STATE  |         IPV4          |                     IPV6                      |   TYPE    | SNAPSHOTS |
+--------+---------+-----------------------+-----------------------------------------------+-----------+-----------+
| bionic | RUNNING | 10.192.200.180 (eth0) | fd42:bd12:4c04:9d99:216:3eff:fe75:70e1 (eth0) | CONTAINER | 0         |
+--------+---------+-----------------------+-----------------------------------------------+-----------+-----------+
| f31    | RUNNING | 10.192.200.2 (eth0)   | fd42:bd12:4c04:9d99:216:3eff:fe85:975c (eth0) | CONTAINER | 0         |
+--------+---------+-----------------------+-----------------------------------------------+-----------+-----------+
ganto commented 4 years ago

Is there a way to downgrade lxd so that I can try to see if it works on 3.18 still? When I try it says 3.20 is the lowest version in your repo

Ya, that's a bit unfortunate. COPR will only keep the latest successfully built package version. Unfortunately since I don't use LXD on Fedora 31 yet, I haven't cached the "official" packages anywere. But I still have the RPMs that I built locally when I was testing the spec file. You can download them from here: https://linuxmonk.ch/packages/lxc3/fedora/31/x86_64/ These RPMs are identical to the COPR one's just that they weren't built on the Fedora infrastructure.

ganto commented 4 years ago

After switching back to cgroup2-only and the default container configuration, I can reproduce your issue. The Ubuntu container would start, but not get an IPv4 address:

[vagrant@fedora31 ~]$ lxc list
+--------+---------+------+-----------------------------------------------+-----------+-----------+
|  NAME  |  STATE  | IPV4 |                     IPV6                      |   TYPE    | SNAPSHOTS |
+--------+---------+------+-----------------------------------------------+-----------+-----------+
| bionic | RUNNING |      | fd42:bd12:4c04:9d99:216:3eff:fe75:70e1 (eth0) | CONTAINER | 0         |
+--------+---------+------+-----------------------------------------------+-----------+-----------+
| f31    | STOPPED |      |                                               | CONTAINER | 0         |
+--------+---------+------+-----------------------------------------------+-----------+-----------+

I can also confirm, that setting the lxc.init.cmd doesn't help. The Fedora 31 container still won't start (same error) with the following settings:

[vagrant@fedora31 ~]$ lxc config show f31
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Fedora 31 amd64 (20200213_20:33)
  image.os: Fedora
  image.release: "31"
  image.serial: "20200213_20:33"
  image.type: squashfs
  raw.lxc: lxc.init.cmd = /sbin/init systemd.unified_cgroup_hierarchy
  volatile.base_image: 17cc572a411a9650ae8ddc6b4e37c01c3215198e66ae5b40a56a80e9d3bb36c3
  volatile.eth0.hwaddr: 00:16:3e:85:97:5c
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":65536}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.power: STOPPED
devices: {}
ephemeral: false
profiles:
- default
stateful: false
description: ""
brauner commented 4 years ago

I'll try to find some time to look into this. It would help me if you could post the cgroup layout of the container by looking at:

cat /proc/<container-init>/cgroup
cat /proc/<container-monitor>/cgroup
ganto commented 4 years ago

I guess you refer to the Ubuntu container, because the Fedora container cannot even be started:

brauner commented 4 years ago

Oh that's an old liblxc version it seems. What's the liblxc version you're using?

ganto commented 4 years ago
Name        : lxc-libs
Version     : 3.2.1
Release     : 0.3.fc31
Architecture: x86_64
Install Date: Fri 14 Feb 2020 01:10:19 AM UTC
Group       : Unspecified
Size        : 1421776
License     : LGPLv2+ and GPLv2
Signature   : RSA/SHA1, Sat 28 Sep 2019 06:01:33 PM UTC, Key ID 97b8ff00f70e1f77
Source RPM  : lxc-3.2.1-0.3.fc31.src.rpm
Build Date  : Sat 28 Sep 2019 06:00:01 PM UTC
Build Host  : copr-builder-734651165.novalocal
URL         : https://linuxcontainers.org/lxc
Summary     : Runtime library files for lxc
Description :
Linux Resource Containers provide process and resource isolation without the
overhead of full virtualization.

The lxc-libs package contains libraries for running lxc applications
brauner commented 4 years ago

Right, you're missing a bunch of patches that are required and will be available in the next release.

ganto commented 4 years ago

Ok, great. Any plan when this is due?

brauner commented 4 years ago

//Cc @stgraber I think the plan was to release 4.0 around end of March-ish?

stgraber commented 4 years ago

LXD 4.0 should be mid to end of March, LXC/LXCFS should be earlier than that.

mrmateo commented 4 years ago

Closing out as this is OBE now since the versions are quite a bit older. Still having similar issues on 4.0.1 (snap package) on F32 but I will look in to that and open tickets in other areas. Thanks for all the input!

mrmateo commented 4 years ago

For anyone in the future who finds this from searching, my issues with LXD 4 (via snapd) on Fedora 32 look to be caused by firewalld. Turning off firewalld the containers work fine so I just need to go in and see what's not being set let through.