canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

LXD becomes unresponsive waiting for information of QMP socket #11892

Closed paride closed 1 year ago

paride commented 1 year ago

Required information

Issue description

On a machine making heavy use of LXD it sometimes happens that many LXD commands become unresponsive, e.g. lxc list or lxc info <some-vm-instance>. With guidance from @tomponline I enabled pprof and apparently LXD gets stuck waiting for information from the QMP socket:

goroutine 2250137 [sync.Mutex.Lock, 18 minutes]:
sync.runtime_SemacquireMutex(0xc000f4d3e0?, 0x60?, 0x60?)
    /snap/go/current/src/runtime/sema.go:77 +0x26
sync.(*Mutex).lockSlow(0x337e4f8)
    /snap/go/current/src/sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)
    /snap/go/current/src/sync/mutex.go:90
github.com/lxc/lxd/lxd/instance/drivers/qmp.Connect({0xc0011e2aa0, 0x48}, {0x1d998f9, 0x13}, 0xc000f4d3e0)
    /build/lxd/parts/lxd/src/lxd/instance/drivers/qmp/monitor.go:216 +0x9e
github.com/lxc/lxd/lxd/instance/drivers.(*qemu).statusCode(0xc00176f680)
    /build/lxd/parts/lxd/src/lxd/instance/drivers/driver_qemu.go:7345 +0x14b
github.com/lxc/lxd/lxd/instance/drivers.(*qemu).Render(0xc00176f680, {0x0, 0x0, 0xc000a29001?})
    /build/lxd/parts/lxd/src/lxd/instance/drivers/driver_qemu.go:7007 +0x3f2
github.com/lxc/lxd/lxd/instance/drivers.(*qemu).RenderFull(0xc00176f680, {0xc000167a40?, 0x13919?, 0xc0014586b0?})
    /build/lxd/parts/lxd/src/lxd/instance/drivers/driver_qemu.go:7047 +0x4a
main.doInstancesGet.func5()
    /build/lxd/parts/lxd/src/lxd/instances_get.go:460 +0x25f
created by main.doInstancesGet
    /build/lxd/parts/lxd/src/lxd/instances_get.go:437 +0x1085

Full pprof data: https://paste.ubuntu.com/p/3FwhpRtH7H/

Killing the lxd process (kill -9 <pid of lxd>) and waiting for snapd to restart it "fixes" the issue. This without manually killing any qemu process.

Steps to reproduce

Not easy to reproduce. Happens randomly every few days on a bare metal machine running Jammy and using LXD (snap) to start/destroy several LXD containers and VMs per day.

Information to attach

lxd info

``` config: core.debug_address: 127.0.0.1:8444 api_extensions: - storage_zfs_remove_snapshots - container_host_shutdown_timeout - container_stop_priority - container_syscall_filtering - auth_pki - container_last_used_at - etag - patch - usb_devices - https_allowed_credentials - image_compression_algorithm - directory_manipulation - container_cpu_time - storage_zfs_use_refquota - storage_lvm_mount_options - network - profile_usedby - container_push - container_exec_recording - certificate_update - container_exec_signal_handling - gpu_devices - container_image_properties - migration_progress - id_map - network_firewall_filtering - network_routes - storage - file_delete - file_append - network_dhcp_expiry - storage_lvm_vg_rename - storage_lvm_thinpool_rename - network_vlan - image_create_aliases - container_stateless_copy - container_only_migration - storage_zfs_clone_copy - unix_device_rename - storage_lvm_use_thinpool - storage_rsync_bwlimit - network_vxlan_interface - storage_btrfs_mount_options - entity_description - image_force_refresh - storage_lvm_lv_resizing - id_map_base - file_symlinks - container_push_target - network_vlan_physical - storage_images_delete - container_edit_metadata - container_snapshot_stateful_migration - storage_driver_ceph - storage_ceph_user_name - resource_limits - storage_volatile_initial_source - storage_ceph_force_osd_reuse - storage_block_filesystem_btrfs - resources - kernel_limits - storage_api_volume_rename - macaroon_authentication - network_sriov - console - restrict_devlxd - migration_pre_copy - infiniband - maas_network - devlxd_events - proxy - network_dhcp_gateway - file_get_symlink - network_leases - unix_device_hotplug - storage_api_local_volume_handling - operation_description - clustering - event_lifecycle - storage_api_remote_volume_handling - nvidia_runtime - container_mount_propagation - container_backup - devlxd_images - container_local_cross_pool_handling - proxy_unix - proxy_udp - clustering_join - proxy_tcp_udp_multi_port_handling - network_state - proxy_unix_dac_properties - container_protection_delete - unix_priv_drop - pprof_http - proxy_haproxy_protocol - network_hwaddr - proxy_nat - network_nat_order - container_full - candid_authentication - backup_compression - candid_config - nvidia_runtime_config - storage_api_volume_snapshots - storage_unmapped - projects - candid_config_key - network_vxlan_ttl - container_incremental_copy - usb_optional_vendorid - snapshot_scheduling - snapshot_schedule_aliases - container_copy_project - clustering_server_address - clustering_image_replication - container_protection_shift - snapshot_expiry - container_backup_override_pool - snapshot_expiry_creation - network_leases_location - resources_cpu_socket - resources_gpu - resources_numa - kernel_features - id_map_current - event_location - storage_api_remote_volume_snapshots - network_nat_address - container_nic_routes - rbac - cluster_internal_copy - seccomp_notify - lxc_features - container_nic_ipvlan - network_vlan_sriov - storage_cephfs - container_nic_ipfilter - resources_v2 - container_exec_user_group_cwd - container_syscall_intercept - container_disk_shift - storage_shifted - resources_infiniband - daemon_storage - instances - image_types - resources_disk_sata - clustering_roles - images_expiry - resources_network_firmware - backup_compression_algorithm - ceph_data_pool_name - container_syscall_intercept_mount - compression_squashfs - container_raw_mount - container_nic_routed - container_syscall_intercept_mount_fuse - container_disk_ceph - virtual-machines - image_profiles - clustering_architecture - resources_disk_id - storage_lvm_stripes - vm_boot_priority - unix_hotplug_devices - api_filtering - instance_nic_network - clustering_sizing - firewall_driver - projects_limits - container_syscall_intercept_hugetlbfs - limits_hugepages - container_nic_routed_gateway - projects_restrictions - custom_volume_snapshot_expiry - volume_snapshot_scheduling - trust_ca_certificates - snapshot_disk_usage - clustering_edit_roles - container_nic_routed_host_address - container_nic_ipvlan_gateway - resources_usb_pci - resources_cpu_threads_numa - resources_cpu_core_die - api_os - container_nic_routed_host_table - container_nic_ipvlan_host_table - container_nic_ipvlan_mode - resources_system - images_push_relay - network_dns_search - container_nic_routed_limits - instance_nic_bridged_vlan - network_state_bond_bridge - usedby_consistency - custom_block_volumes - clustering_failure_domains - resources_gpu_mdev - console_vga_type - projects_limits_disk - network_type_macvlan - network_type_sriov - container_syscall_intercept_bpf_devices - network_type_ovn - projects_networks - projects_networks_restricted_uplinks - custom_volume_backup - backup_override_name - storage_rsync_compression - network_type_physical - network_ovn_external_subnets - network_ovn_nat - network_ovn_external_routes_remove - tpm_device_type - storage_zfs_clone_copy_rebase - gpu_mdev - resources_pci_iommu - resources_network_usb - resources_disk_address - network_physical_ovn_ingress_mode - network_ovn_dhcp - network_physical_routes_anycast - projects_limits_instances - network_state_vlan - instance_nic_bridged_port_isolation - instance_bulk_state_change - network_gvrp - instance_pool_move - gpu_sriov - pci_device_type - storage_volume_state - network_acl - migration_stateful - disk_state_quota - storage_ceph_features - projects_compression - projects_images_remote_cache_expiry - certificate_project - network_ovn_acl - projects_images_auto_update - projects_restricted_cluster_target - images_default_architecture - network_ovn_acl_defaults - gpu_mig - project_usage - network_bridge_acl - warnings - projects_restricted_backups_and_snapshots - clustering_join_token - clustering_description - server_trusted_proxy - clustering_update_cert - storage_api_project - server_instance_driver_operational - server_supported_storage_drivers - event_lifecycle_requestor_address - resources_gpu_usb - clustering_evacuation - network_ovn_nat_address - network_bgp - network_forward - custom_volume_refresh - network_counters_errors_dropped - metrics - image_source_project - clustering_config - network_peer - linux_sysctl - network_dns - ovn_nic_acceleration - certificate_self_renewal - instance_project_move - storage_volume_project_move - cloud_init - network_dns_nat - database_leader - instance_all_projects - clustering_groups - ceph_rbd_du - instance_get_full - qemu_metrics - gpu_mig_uuid - event_project - clustering_evacuation_live - instance_allow_inconsistent_copy - network_state_ovn - storage_volume_api_filtering - image_restrictions - storage_zfs_export - network_dns_records - storage_zfs_reserve_space - network_acl_log - storage_zfs_blocksize - metrics_cpu_seconds - instance_snapshot_never - certificate_token - instance_nic_routed_neighbor_probe - event_hub - agent_nic_config - projects_restricted_intercept - metrics_authentication - images_target_project - cluster_migration_inconsistent_copy - cluster_ovn_chassis - container_syscall_intercept_sched_setscheduler - storage_lvm_thinpool_metadata_size - storage_volume_state_total - instance_file_head - instances_nic_host_name - image_copy_profile - container_syscall_intercept_sysinfo - clustering_evacuation_mode - resources_pci_vpd - qemu_raw_conf - storage_cephfs_fscache - network_load_balancer - vsock_api - instance_ready_state - network_bgp_holdtime - storage_volumes_all_projects - metrics_memory_oom_total - storage_buckets - storage_buckets_create_credentials - metrics_cpu_effective_total - projects_networks_restricted_access - storage_buckets_local - loki - acme - internal_metrics - cluster_join_token_expiry - remote_token_expiry - init_preseed - storage_volumes_created_at - cpu_hotplug - projects_networks_zones - network_txqueuelen - cluster_member_state - instances_placement_scriptlet - storage_pool_source_wipe - zfs_block_mode - instance_generation_id - disk_io_cache - amd_sev - storage_pool_loop_resize - migration_vm_live - ovn_nic_nesting - oidc - network_ovn_l3only - ovn_nic_acceleration_vdpa - cluster_healing - instances_state_total api_status: stable api_version: "1.0" auth: trusted public: false auth_methods: - tls environment: addresses: [] architectures: - x86_64 - i686 certificate: | -----BEGIN CERTIFICATE----- MIICBjCCAYygAwIBAgIQLwn5+hm0ujz9RdvVpWphSzAKBggqhkjOPQQDAzA1MRww GgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMRUwEwYDVQQDDAxyb290QGtlY2xl b24wHhcNMjIwNTAyMDgwNDE0WhcNMzIwNDI5MDgwNDE0WjA1MRwwGgYDVQQKExNs aW51eGNvbnRhaW5lcnMub3JnMRUwEwYDVQQDDAxyb290QGtlY2xlb24wdjAQBgcq hkjOPQIBBgUrgQQAIgNiAASMW37FAAD4O0BTX4M8kFVEkaPqTETSnc4TU/UKzA/y uQavs6X3dxwG/js+saymVdBNO/nzpodGMA0U5q3kBXJwMlEMnTwxp2DHw26N1rIh 3a9I4GuMnhJ5ZILzAoJGU+ujYTBfMA4GA1UdDwEB/wQEAwIFoDATBgNVHSUEDDAK BggrBgEFBQcDATAMBgNVHRMBAf8EAjAAMCoGA1UdEQQjMCGCB2tlY2xlb26HBH8A AAGHEAAAAAAAAAAAAAAAAAAAAAEwCgYIKoZIzj0EAwMDaAAwZQIxAKienzNCcyDU h8PV7qMoOgM8akGQNBst8LC2aLIxAGBdQLxAxcrg2I3HM4Uu2Z6zPQIwS24tnpUB FCBCmOmJN+Izf+mGEizMK0sE7qZ8nK/lSaBtESr4jSOe0hpCk6qjcT4A -----END CERTIFICATE----- certificate_fingerprint: 9ee5af9f4a1e60c6cfae0614ee435a6da9c6bb155f313f51ca96a5b240d9bde7 driver: lxc | qemu driver_version: 5.0.2 | 8.0.0 firewall: nftables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "true" netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" shiftfs: "false" uevent_injection: "true" unpriv_fscaps: "true" kernel_version: 5.15.0-75-generic lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Ubuntu os_version: "22.04" project: default server: lxd server_clustered: false server_event_mode: full-mesh server_name: kecleon server_pid: 2604 server_version: "5.14" storage: zfs storage_version: 2.1.5-1ubuntu6~22.04.1 storage_supported_drivers: - name: cephfs version: 17.2.5 remote: true - name: cephobject version: 17.2.5 remote: true - name: dir version: "1" remote: false - name: lvm version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0 remote: false - name: zfs version: 2.1.5-1ubuntu6~22.04.1 remote: false - name: btrfs version: 5.16.2 remote: false - name: ceph version: 17.2.5 remote: true ```
tomponline commented 1 year ago

full pprof trace https://paste.ubuntu.com/p/7WPjrfW4KX/

tomponline commented 1 year ago

There is a thread blocked in the digital ocean qmp package:

goroutine 734540 [IO wait]:
internal/poll.runtime_pollWait(0x7fed2479e3f0, 0x72)
    /snap/go/current/src/runtime/netpoll.go:306 +0x89
internal/poll.(*pollDesc).wait(0xc000c24c00?, 0xc000ca00b3?, 0x0)
    /snap/go/current/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
    /snap/go/current/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000c24c00, {0xc000ca00b3, 0xf4d, 0xf4d})
    /snap/go/current/src/internal/poll/fd_unix.go:167 +0x299
net.(*netFD).Read(0xc000c24c00, {0xc000ca00b3?, 0x0?, 0x0?})
    /snap/go/current/src/net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc00058e6f8, {0xc000ca00b3?, 0xc001addda8?, 0x411b31?})
    /snap/go/current/src/net/net.go:183 +0x45
bufio.(*Scanner).Scan(0xc001addf08)
    /snap/go/current/src/bufio/scan.go:214 +0x876
github.com/digitalocean/go-qemu/qmp.(*SocketMonitor).listen(0xc000cb2280, {0x20e1880?, 0xc00058e6f8?}, 0xc000c7eb40, 0xc000c7eba0)
    /build/lxd/parts/lxd/src/.go/pkg/mod/github.com/digitalocean/go-qemu@v0.0.0-20221209210016-f035778c97f7/qmp/socket.go:175 +0x10b
created by github.com/digitalocean/go-qemu/qmp.(*SocketMonitor).Connect
    /build/lxd/parts/lxd/src/.go/pkg/mod/github.com/digitalocean/go-qemu@v0.0.0-20221209210016-f035778c97f7/qmp/socket.go:151 +0x36a
tomponline commented 1 year ago

OK so this could be caused by a bug in the digitalocean/go-qemu package.

So LXD calls qmp.NewSocketMonitor() which succeeds https://github.com/lxc/lxd/blob/master/lxd/instance/drivers/qmp/monitor.go#L227

Then LXD tries a Connect() call in a different go routine and waits for a response https://github.com/lxc/lxd/blob/master/lxd/instance/drivers/qmp/monitor.go#L232-L236

If after 5s Connect() doesn't return, then LXD calls Disconnect() https://github.com/lxc/lxd/blob/master/lxd/instance/drivers/qmp/monitor.go#L244-L247

And I can see go routine blocked in Disconnect() for a long time:

goroutine 2524696 [chan receive (nil chan), 5400 minutes]:
github.com/digitalocean/go-qemu/qmp.(*SocketMonitor).Disconnect(...)
    /build/lxd/parts/lxd/src/.go/pkg/mod/github.com/digitalocean/go-qemu@v0.0.0-20221209210016-f035778c97f7/qmp/socket.go:107
github.com/lxc/lxd/lxd/instance/drivers/qmp.Connect({0xc0005b4730, 0x48}, {0x1d998f9, 0x13}, 0xc000f3f320)
    /build/lxd/parts/lxd/src/lxd/instance/drivers/qmp/monitor.go:245 +0x64e
github.com/lxc/lxd/lxd/instance/drivers.(*qemu).statusCode(0xc000d4a600)
    /build/lxd/parts/lxd/src/lxd/instance/drivers/driver_qemu.go:7345 +0x14b
github.com/lxc/lxd/lxd/instance/drivers.(*qemu).Render(0xc000d4a600, {0x0, 0x0, 0x1f?})
    /build/lxd/parts/lxd/src/lxd/instance/drivers/driver_qemu.go:7007 +0x3f2
github.com/lxc/lxd/lxd/instance/drivers.(*qemu).RenderFull(0xc000d4a600, {0xc001123590?, 0x137eb?, 0x20?})
    /build/lxd/parts/lxd/src/lxd/instance/drivers/driver_qemu.go:7047 +0x4a
main.doInstancesGet.func5()
    /build/lxd/parts/lxd/src/lxd/instances_get.go:460 +0x25f
created by main.doInstancesGet
    /build/lxd/parts/lxd/src/lxd/instances_get.go:437 +0x1085

Looking at the go-qemu package I can see a potential bug in the way Connect() and Disconnect() work:

We can see that Disconnect() is blocked on this line:

https://github.com/digitalocean/go-qemu/blob/master/qmp/socket.go#L107

And we can see that Connect() only populates mon.stream if it gets a valid response from the QMP socket.

https://github.com/digitalocean/go-qemu/blob/master/qmp/socket.go#L154

So calling Disconnect() before Connect() has initialised mon.stream will block forever (as reading from a nil channel blocks).

So I think we need to make a contribution upstream to ensure that mon.stream is initialised in NewSocketMonitor and closed if Connect() fails.

tomponline commented 1 year ago

I also think it would be good to add a ConnectCtx() function to that package so we can pass a connect cancellation context to it.

tomponline commented 1 year ago

Waiting on upstream to merge https://github.com/digitalocean/go-qemu/pull/201