canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

lxc hangs #9104

Closed Freeaqingme closed 3 years ago

Freeaqingme commented 3 years ago

I'm not entirely sure if this issue pertains to LXC or LXD. I just wanted to enter a VM using lxc exec <vmName> bash. That command hangs consistently without any output shown. lxc ls also hangs with no output whatsoever (unless in debug mode):

# lxc --debug ls
DBUG[08-10|10:38:33] Connecting to a local LXD over a Unix socket 
DBUG[08-10|10:38:33] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
DBUG[08-10|10:38:33] Got response struct from LXD 
DBUG[08-10|10:38:33] 
    {
        "config": {
            "core.https_address": "[fd9d:ebbc:537f:ee74::10]:3448",
            "core.trust_password": true
        },
        "api_extensions": [
            "storage_zfs_remove_snapshots",
            "container_host_shutdown_timeout",
            "container_stop_priority",
            "container_syscall_filtering",
            "auth_pki",
            "container_last_used_at",
            "etag",
            "patch",
            "usb_devices",
            "https_allowed_credentials",
            "image_compression_algorithm",
            "directory_manipulation",
            "container_cpu_time",
            "storage_zfs_use_refquota",
            "storage_lvm_mount_options",
            "network",
            "profile_usedby",
            "container_push",
            "container_exec_recording",
            "certificate_update",
            "container_exec_signal_handling",
            "gpu_devices",
            "container_image_properties",
            "migration_progress",
            "id_map",
            "network_firewall_filtering",
            "network_routes",
            "storage",
            "file_delete",
            "file_append",
            "network_dhcp_expiry",
            "storage_lvm_vg_rename",
            "storage_lvm_thinpool_rename",
            "network_vlan",
            "image_create_aliases",
            "container_stateless_copy",
            "container_only_migration",
            "storage_zfs_clone_copy",
            "unix_device_rename",
            "storage_lvm_use_thinpool",
            "storage_rsync_bwlimit",
            "network_vxlan_interface",
            "storage_btrfs_mount_options",
            "entity_description",
            "image_force_refresh",
            "storage_lvm_lv_resizing",
            "id_map_base",
            "file_symlinks",
            "container_push_target",
            "network_vlan_physical",
            "storage_images_delete",
            "container_edit_metadata",
            "container_snapshot_stateful_migration",
            "storage_driver_ceph",
            "storage_ceph_user_name",
            "resource_limits",
            "storage_volatile_initial_source",
            "storage_ceph_force_osd_reuse",
            "storage_block_filesystem_btrfs",
            "resources",
            "kernel_limits",
            "storage_api_volume_rename",
            "macaroon_authentication",
            "network_sriov",
            "console",
            "restrict_devlxd",
            "migration_pre_copy",
            "infiniband",
            "maas_network",
            "devlxd_events",
            "proxy",
            "network_dhcp_gateway",
            "file_get_symlink",
            "network_leases",
            "unix_device_hotplug",
            "storage_api_local_volume_handling",
            "operation_description",
            "clustering",
            "event_lifecycle",
            "storage_api_remote_volume_handling",
            "nvidia_runtime",
            "container_mount_propagation",
            "container_backup",
            "devlxd_images",
            "container_local_cross_pool_handling",
            "proxy_unix",
            "proxy_udp",
            "clustering_join",
            "proxy_tcp_udp_multi_port_handling",
            "network_state",
            "proxy_unix_dac_properties",
            "container_protection_delete",
            "unix_priv_drop",
            "pprof_http",
            "proxy_haproxy_protocol",
            "network_hwaddr",
            "proxy_nat",
            "network_nat_order",
            "container_full",
            "candid_authentication",
            "backup_compression",
            "candid_config",
            "nvidia_runtime_config",
            "storage_api_volume_snapshots",
            "storage_unmapped",
            "projects",
            "candid_config_key",
            "network_vxlan_ttl",
            "container_incremental_copy",
            "usb_optional_vendorid",
            "snapshot_scheduling",
            "snapshot_schedule_aliases",
            "container_copy_project",
            "clustering_server_address",
            "clustering_image_replication",
            "container_protection_shift",
            "snapshot_expiry",
            "container_backup_override_pool",
            "snapshot_expiry_creation",
            "network_leases_location",
            "resources_cpu_socket",
            "resources_gpu",
            "resources_numa",
            "kernel_features",
            "id_map_current",
            "event_location",
            "storage_api_remote_volume_snapshots",
            "network_nat_address",
            "container_nic_routes",
            "rbac",
            "cluster_internal_copy",
            "seccomp_notify",
            "lxc_features",
            "container_nic_ipvlan",
            "network_vlan_sriov",
            "storage_cephfs",
            "container_nic_ipfilter",
            "resources_v2",
            "container_exec_user_group_cwd",
            "container_syscall_intercept",
            "container_disk_shift",
            "storage_shifted",
            "resources_infiniband",
            "daemon_storage",
            "instances",
            "image_types",
            "resources_disk_sata",
            "clustering_roles",
            "images_expiry",
            "resources_network_firmware",
            "backup_compression_algorithm",
            "ceph_data_pool_name",
            "container_syscall_intercept_mount",
            "compression_squashfs",
            "container_raw_mount",
            "container_nic_routed",
            "container_syscall_intercept_mount_fuse",
            "container_disk_ceph",
            "virtual-machines",
            "image_profiles",
            "clustering_architecture",
            "resources_disk_id",
            "storage_lvm_stripes",
            "vm_boot_priority",
            "unix_hotplug_devices",
            "api_filtering",
            "instance_nic_network",
            "clustering_sizing",
            "firewall_driver",
            "projects_limits",
            "container_syscall_intercept_hugetlbfs",
            "limits_hugepages",
            "container_nic_routed_gateway",
            "projects_restrictions",
            "custom_volume_snapshot_expiry",
            "volume_snapshot_scheduling",
            "trust_ca_certificates",
            "snapshot_disk_usage",
            "clustering_edit_roles",
            "container_nic_routed_host_address",
            "container_nic_ipvlan_gateway",
            "resources_usb_pci",
            "resources_cpu_threads_numa",
            "resources_cpu_core_die",
            "api_os",
            "container_nic_routed_host_table",
            "container_nic_ipvlan_host_table",
            "container_nic_ipvlan_mode",
            "resources_system",
            "images_push_relay",
            "network_dns_search",
            "container_nic_routed_limits",
            "instance_nic_bridged_vlan",
            "network_state_bond_bridge",
            "usedby_consistency",
            "custom_block_volumes",
            "clustering_failure_domains",
            "resources_gpu_mdev",
            "console_vga_type",
            "projects_limits_disk",
            "network_type_macvlan",
            "network_type_sriov",
            "container_syscall_intercept_bpf_devices",
            "network_type_ovn",
            "projects_networks",
            "projects_networks_restricted_uplinks",
            "custom_volume_backup",
            "backup_override_name",
            "storage_rsync_compression",
            "network_type_physical",
            "network_ovn_external_subnets",
            "network_ovn_nat",
            "network_ovn_external_routes_remove",
            "tpm_device_type",
            "storage_zfs_clone_copy_rebase",
            "gpu_mdev",
            "resources_pci_iommu",
            "resources_network_usb",
            "resources_disk_address",
            "network_physical_ovn_ingress_mode",
            "network_ovn_dhcp",
            "network_physical_routes_anycast",
            "projects_limits_instances",
            "network_state_vlan",
            "instance_nic_bridged_port_isolation",
            "instance_bulk_state_change",
            "network_gvrp",
            "instance_pool_move",
            "gpu_sriov",
            "pci_device_type",
            "storage_volume_state",
            "network_acl",
            "migration_stateful",
            "disk_state_quota",
            "storage_ceph_features",
            "projects_compression",
            "projects_images_remote_cache_expiry",
            "certificate_project",
            "network_ovn_acl",
            "projects_images_auto_update",
            "projects_restricted_cluster_target",
            "images_default_architecture",
            "network_ovn_acl_defaults",
            "gpu_mig",
            "project_usage",
            "network_bridge_acl",
            "warnings",
            "projects_restricted_backups_and_snapshots",
            "clustering_join_token",
            "clustering_description",
            "server_trusted_proxy",
            "clustering_update_cert",
            "storage_api_project",
            "server_instance_driver_operational",
            "server_supported_storage_drivers",
            "event_lifecycle_requestor_address",
            "resources_gpu_usb",
            "clustering_evacuation"
        ],
        "api_status": "stable",
        "api_version": "1.0",
        "auth": "trusted",
        "public": false,
        "auth_methods": [
            "tls"
        ],
        "environment": {
            "addresses": [
                "[fd9d:ebbc:537f:ee74::10]:3448"
            ],
            "architectures": [
                "x86_64",
                "i686"
            ],
            "certificate": "-----BEGIN CERTIFICATE-----\nMIIB/jCCAYSgAwIBAgIRAMG8pIQ4vFBVtj1xKrqU3rEwCgYIKoZIzj0EAwMwMjEc\nMBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzESMBAGA1UEAwwJcm9vdEBocHYx\nMB4XDTIxMDcxNTIwMDEzMloXDTMxMDcxMzIwMDEzMlowMjEcMBoGA1UEChMTbGlu\ndXhjb250YWluZXJzLm9yZzESMBAGA1UEAwwJcm9vdEBocHYxMHYwEAYHKoZIzj0C\nAQYFK4EEACIDYgAEbS1ZcE/iCAVy5YAHtT7VduNewXg3IKS9lCJMrZwkLgK+urDE\nrZJqlWRSga2nlFgwAoUktAZi7PpQDT+mFoAwPrzW9WHR1fOw7Q6Q8e3WKEcjw6GQ\n/Ki7jjdPmUB8e+Fvo14wXDAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0lBAwwCgYIKwYB\nBQUHAwEwDAYDVR0TAQH/BAIwADAnBgNVHREEIDAeggRocHYxhwR/AAABhxAAAAAA\nAAAAAAAAAAAAAAABMAoGCCqGSM49BAMDA2gAMGUCMQCsDOkpffalmACTCz6rgFrL\nzhMyI3c7AuoIbbxFJJh6sUDFpgHAvAQLgXCGT0jXmScCMFD7q95B053HE0zmW5Qs\nuyZVNOzNu7es39s8K+tqRBQ5qh/SjIYIhBz6VqW8m82vBA==\n-----END CERTIFICATE-----\n",
            "certificate_fingerprint": "d15d40ba61a43cb7d3f17e694858c97c7c28161ada22383d6a3f15f4b317ba7d",
            "driver": "lxc | qemu",
            "driver_version": "4.0.10 | 5.2.0",
            "firewall": "nftables",
            "kernel": "Linux",
            "kernel_architecture": "x86_64",
            "kernel_features": {
                "netnsid_getifaddrs": "true",
                "seccomp_listener": "true",
                "seccomp_listener_continue": "true",
                "shiftfs": "false",
                "uevent_injection": "true",
                "unpriv_fscaps": "true"
            },
            "kernel_version": "5.4.0-77-generic",
            "lxc_features": {
                "cgroup2": "true",
                "devpts_fd": "true",
                "idmapped_mounts_v2": "true",
                "mount_injection_file": "true",
                "network_gateway_device_route": "true",
                "network_ipvlan": "true",
                "network_l2proxy": "true",
                "network_phys_macvlan_mtu": "true",
                "network_veth_router": "true",
                "pidfd": "true",
                "seccomp_allow_deny_syntax": "true",
                "seccomp_notify": "true",
                "seccomp_proxy_send_notify_fd": "true"
            },
            "os_name": "Ubuntu",
            "os_version": "20.04",
            "project": "default",
            "server": "lxd",
            "server_clustered": false,
            "server_name": "hpv1",
            "server_pid": 54818,
            "server_version": "4.17",
            "storage": "zfs",
            "storage_version": "0.8.3-1ubuntu12.9",
            "storage_supported_drivers": [
                {
                    "Name": "zfs",
                    "Version": "0.8.3-1ubuntu12.9",
                    "Remote": false
                },
                {
                    "Name": "ceph",
                    "Version": "15.2.13",
                    "Remote": true
                },
                {
                    "Name": "btrfs",
                    "Version": "5.4.1",
                    "Remote": false
                },
                {
                    "Name": "cephfs",
                    "Version": "15.2.13",
                    "Remote": true
                },
                {
                    "Name": "dir",
                    "Version": "1",
                    "Remote": false
                },
                {
                    "Name": "lvm",
                    "Version": "2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.41.0",
                    "Remote": false
                }
            ]
        }
    } 
DBUG[08-10|10:38:33] Sending request to LXD                   method=GET url="http://unix.socket/1.0/instances?recursion=2" etag=

^C

130 root@hpv1 /home/vagrant
# date
Tue 10 Aug 2021 10:44:40 AM UTC

I think that for a command like this there could/should be a timeout on the call to the unix.socket, but that obviously wouldn't fix the problem. How should I go about fixing it and making things responsive again? It's a test environment so I could easily just reboot something, but as it is a lab set up for a future production environment I'd like to be able to fix it in a way I would also do on production in the future.

Required information

0 root@hpv1 /home/vagrant
# uname -a
Linux hpv1 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

0 root@hpv1 /home/vagrant
# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"

0 root@hpv1 /home/vagrant
# lxc info
config:
  core.https_address: '[fd9d:ebbc:537f:ee74::10]:3448'
  core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - '[fd9d:ebbc:537f:ee74::10]:3448'
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIIB/jCCAYSgAwIBAgIRAMG8pIQ4vFBVtj1xKrqU3rEwCgYIKoZIzj0EAwMwMjEc
    MBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzESMBAGA1UEAwwJcm9vdEBocHYx
    MB4XDTIxMDcxNTIwMDEzMloXDTMxMDcxMzIwMDEzMlowMjEcMBoGA1UEChMTbGlu
    dXhjb250YWluZXJzLm9yZzESMBAGA1UEAwwJcm9vdEBocHYxMHYwEAYHKoZIzj0C
    AQYFK4EEACIDYgAEbS1ZcE/iCAVy5YAHtT7VduNewXg3IKS9lCJMrZwkLgK+urDE
    rZJqlWRSga2nlFgwAoUktAZi7PpQDT+mFoAwPrzW9WHR1fOw7Q6Q8e3WKEcjw6GQ
    /Ki7jjdPmUB8e+Fvo14wXDAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0lBAwwCgYIKwYB
    BQUHAwEwDAYDVR0TAQH/BAIwADAnBgNVHREEIDAeggRocHYxhwR/AAABhxAAAAAA
    AAAAAAAAAAAAAAABMAoGCCqGSM49BAMDA2gAMGUCMQCsDOkpffalmACTCz6rgFrL
    zhMyI3c7AuoIbbxFJJh6sUDFpgHAvAQLgXCGT0jXmScCMFD7q95B053HE0zmW5Qs
    uyZVNOzNu7es39s8K+tqRBQ5qh/SjIYIhBz6VqW8m82vBA==
    -----END CERTIFICATE-----
  certificate_fingerprint: d15d40ba61a43cb7d3f17e694858c97c7c28161ada22383d6a3f15f4b317ba7d
  driver: lxc | qemu
  driver_version: 4.0.10 | 5.2.0
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.4.0-77-generic
  lxc_features:
    cgroup2: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "20.04"
  project: default
  server: lxd
  server_clustered: false
  server_name: hpv1
  server_pid: 54818
  server_version: "4.17"
  storage: zfs
  storage_version: 0.8.3-1ubuntu12.9
  storage_supported_drivers:
  - name: zfs
    version: 0.8.3-1ubuntu12.9
    remote: false
  - name: ceph
    version: 15.2.13
    remote: true
  - name: btrfs
    version: 5.4.1
    remote: false
  - name: cephfs
    version: 15.2.13
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.41.0
    remote: false

I

Information to attach

lxd.log:

t=2021-08-10T08:05:40+0000 lvl=info msg="Pruning resolved warnings" 
t=2021-08-10T08:05:40+0000 lvl=info msg="Done updating instance types" 
t=2021-08-10T08:05:40+0000 lvl=info msg="Done updating images" 
t=2021-08-10T08:05:40+0000 lvl=info msg="Done pruning resolved warnings" 
t=2021-08-10T08:18:54+0000 lvl=warn msg="Detected poll(POLLNVAL) event: exiting." 
t=2021-08-10T09:05:40+0000 lvl=info msg="Updating images" 
t=2021-08-10T09:05:40+0000 lvl=info msg="Pruning expired instance backups" 
t=2021-08-10T09:05:40+0000 lvl=info msg="Done updating images" 
t=2021-08-10T09:05:40+0000 lvl=info msg="Done pruning expired instance backups" 
t=2021-08-10T10:05:40+0000 lvl=info msg="Updating images" 
t=2021-08-10T10:05:40+0000 lvl=info msg="Pruning expired instance backups" 
t=2021-08-10T10:05:40+0000 lvl=info msg="Done updating images" 
t=2021-08-10T10:05:40+0000 lvl=info msg="Done pruning expired instance backups" 
Freeaqingme commented 3 years ago

I figured I might be able to attach a debugger to see what's going on. However, unfortunately, gdb doesn't know what to do with Go. Also, Delve complains that it "could not open debug info" (presumably because of missing DWARF headers).

Having said that, for one reason or another* everything just seems to have resolved.

I realize this issue reads a little like a support issue, but I think that the fact that the socket can become unresponsive and make anything that connects to it unresponsive is a bit of a bug. But it may be quite difficult to find out what was causing it.


There was one thing that I changed. I have this script running every 1 second to transfer some files to my VM:

#!/bin/bash

vmName=puppet1
status=$(lxc info $vmName)
if [ $? != 0 ]; then
  # vm not running
  exit
fi

codePath=/etc/puppetlabs/code/environments/dev
lastModifiedTimestamp=$(find $codePath -type f -printf '%T@ %p\n' | \
                              sort -n | tail -1 | cut -d'.' -f1)
lastFewSecondsTimestamp=$(echo $(date +%s)-5 | bc)

if (( $lastModifiedTimestamp > $lastFewSecondsTimestamp )); then
  echo "Files modified in last 5 seconds, syncing..."
  tar zcf - -C $codePath . | lxc exec $vmName -- tar zxvf - -C $codePath >/dev/null
  echo "Syncing complete!"
fi

The issue seemed to resolve itself right when I put an 'exit 0' statement on line 2. Perhaps that may or may not give a hint as to where the issue originates?

stgraber commented 3 years ago

Can you show the output of dmesg and ps fauxww?

Per the output above, LXD itself is responding, but then gets locked up when trying to access the instances, so this smells like a kernel or hardware issue.

Freeaqingme commented 3 years ago

@stgraber dmesg is - for that time period - empty. Just to give an idea:

# dmesg -H -T | tail
[Tue Aug 10 08:12:45 2021] IPv4: martian source 10.0.2.15 from 91.189.92.40, on dev eth0
[Tue Aug 10 08:12:45 2021] ll header: 00000000: 08 00 27 5e 84 f7 52 54 00 12 35 02 08 00
[Tue Aug 10 08:12:58 2021] IPv4: martian source 10.0.2.15 from 91.189.92.40, on dev eth0
[Tue Aug 10 08:12:58 2021] ll header: 00000000: 08 00 27 5e 84 f7 52 54 00 12 35 02 08 00
[Tue Aug 10 08:13:11 2021] IPv4: martian source 10.0.2.15 from 91.189.92.40, on dev eth0
[Tue Aug 10 08:13:11 2021] ll header: 00000000: 08 00 27 5e 84 f7 52 54 00 12 35 02 08 00
[Tue Aug 10 08:56:25 2021] IPv4: martian source 10.0.2.15 from 192.168.1.1, on dev eth0
[Tue Aug 10 08:56:25 2021] ll header: 00000000: 08 00 27 5e 84 f7 52 54 00 12 35 02 08 00
[Tue Aug 10 11:35:02 2021] device brVM entered promiscuous mode
[Tue Aug 10 11:35:04 2021] device brVM left promiscuous mode

# dmesg -H -T | grep -i 'lxc\|lxd'

#

I didn't do ps fauxww while the problem was going on. I did however do this:

# systemctl status
● hpv1
    State: degraded
     Jobs: 1 queued
   Failed: 1 units
    Since: Tue 2021-08-10 08:03:14 UTC; 3h 23min ago
   CGroup: /
           ├─ 1717 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
           ├─ 2094 /snap/lxd/20987/bin/virtiofsd --socket-path=/var/snap/lxd/common/lxd/logs/puppet1-dev-formicidae>
           ├─ 2201 /snap/lxd/20987/bin/qemu-system-x86_64 -S -name puppet1-dev-formicidae-holdings -uuid a4b84c16-6>
           ├─ 2206 /snap/lxd/20987/bin/virtiofsd --socket-path=/var/snap/lxd/common/lxd/logs/puppet1-dev-formicidae>
           ├─54259 /bin/sh /snap/lxd/21260/commands/daemon.start
           ├─54818 lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
           ├─init.scope 
           │ └─1 /sbin/init
           └─system.slice 
             ├─irqbalance.service 
             │ └─1061 /usr/sbin/irqbalance --foreground
             ├─haveged.service 
             │ └─955 /usr/sbin/haveged --Foreground --verbose=1 -w 1024
             ├─systemd-networkd.service 
             │ └─581 /lib/systemd/systemd-networkd
             ├─systemd-udevd.service 
             │ ├─   381 /lib/systemd/systemd-udevd
             │ ├─255232 /lib/systemd/systemd-udevd
             │ ├─255233 /lib/systemd/systemd-udevd
             │ ├─255240 /lib/systemd/systemd-udevd
             │ ├─255241 /lib/systemd/systemd-udevd
             │ ├─255242 /lib/systemd/systemd-udevd
             │ ├─255243 /lib/systemd/systemd-udevd
             │ ├─255244 /lib/systemd/systemd-udevd
             │ ├─255245 /lib/systemd/systemd-udevd
             │ ├─255246 /lib/systemd/systemd-udevd
             │ ├─255247 /lib/systemd/systemd-udevd
             │ ├─255248 /lib/systemd/systemd-udevd
             │ ├─255249 /lib/systemd/systemd-udevd
             │ ├─255250 /lib/systemd/systemd-udevd
             │ ├─255251 /lib/systemd/systemd-udevd
             │ ├─255252 /lib/systemd/systemd-udevd
             │ ├─255253 /lib/systemd/systemd-udevd
             │ ├─255254 /lib/systemd/systemd-udevd
             │ ├─255255 /lib/systemd/systemd-udevd
             │ ├─255262 /lib/systemd/systemd-udevd
             │ └─255300 /lib/systemd/systemd-udevd
             ├─cron.service 
             │ └─1396 /usr/sbin/cron -f
             ├─unbound.service 
             │ └─1557 /usr/sbin/unbound -d
             ├─polkit.service 
             │ └─64551 /usr/lib/policykit-1/polkitd --no-debug
             ├─auditd.service 
             │ └─916 /sbin/auditd
             ├─systemd-journald.service 
             │ └─357 /lib/systemd/systemd-journald
             ├─ssh.service 
             │ ├─  1431 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
             │ ├─206609 sshd: vagrant [priv]
             │ ├─206798 sshd: vagrant@pts/0
             │ ├─206799 -bash
             │ ├─227635 doas -s
             │ ├─227636 /bin/bash
             │ ├─234356 sshd: vagrant [priv]
             │ ├─234423 sshd: vagrant@pts/3
             │ ├─234424 -bash
             │ ├─234472 doas -s
             │ ├─234473 /bin/bash
             │ ├─281819 systemctl status
             │ └─281820 pager
             ├─snapd.service 
             │ └─5936 /usr/lib/snapd/snapd
             ├─sync-puppet-code.service 
             │ ├─281779 /bin/bash /usr/local/bin/syncPuppetCode.sh
             │ ├─281814 /bin/bash /usr/local/bin/syncPuppetCode.sh
             │ ├─281815 find /etc/puppetlabs/code/environments/dev -type f -printf %T@ %p
n
             │ ├─281816 sort -n
             │ ├─281817 tail -1
             │ └─281818 cut -d. -f1
             ├─rsyslog.service 
             │ └─1063 /usr/sbin/rsyslogd -n -iNONE
             ├─system-openvpn.slice 
             │ └─openvpn@ipv6.service 
             │   └─1374 /usr/sbin/openvpn --daemon ovpn-ipv6 --status /run/openvpn/ipv6.status 10 --cd /etc/openvpn>
             ├─system-arpwatch.slice 
             │ ├─arpwatch@lo.service 
             │ │ └─1433 /usr/sbin/arpwatch -u arpwatch -i lo -f lo.dat -N -p -F
             │ ├─arpwatch@eth0.service 
             │ │ └─1434 /usr/sbin/arpwatch -u arpwatch -i eth0 -f eth0.dat -N -p -F
             │ └─arpwatch@brVM.service 
             │   └─63507 /usr/sbin/arpwatch -u arpwatch -i brVM -f brVM.dat -N -p -F
             ├─snap.lxd.lxc.9d69383b-ff43-4f2a-9690-a703b3c4d27d.scope 
             │ └─227644 lxc exec puppet1-dev-formicidae-holdings bash
             ├─zfs-zed.service 
             │ └─1071 /usr/sbin/zed -F
             ├─ntp.service 
             │ └─1421 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 113:123
             ├─dbus.service 
             │ └─1056 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation -->
             ├─system-getty.slice 
             │ └─getty@tty1.service 
             │   └─1443 /sbin/agetty -o -p -- \u --noclear tty1 linux
             ├─puppet.service 
             │ └─1375 /opt/puppetlabs/puppet/bin/ruby /opt/puppetlabs/puppet/bin/puppet agent --no-daemonize
             ├─ifup@eth0.service 
             │ └─993 /sbin/dhclient -1 -4 -v -i -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases ->
             ├─virtualbox-guest-utils.service 
             │ └─1192 /usr/sbin/VBoxService
             └─systemd-logind.service 
               └─1068 /lib/systemd/systemd-logind

Re the label 'incomplete'. I think this issue maybe hardly actionable, other than perhaps some error handling or timeout to the call to the socket. But perhaps we can figure out what caused it :)

stgraber commented 3 years ago

Unfortunately the systemctl output doesn't provide me with the information I'd have liked to see (process state). I agree that given the issue went away, there's not a whole lot we can do about it at this point :)

Based on your log output, LXD itself was responding to the API calls, so it wasn't a case of the socket not responding or the client not giving you a connection timeout. Instead the API would get stuck when internally accessing instance state or when attempting to execute a command.

This suggests that one of your instances was very very stuck as internally we have a 10s per instance timeout which would normally get us going. The only case where we can't move on is if we just go ourselves stuck in the kernel. This is usually visible in a ps fauxww output as there being a bunch of processes in uninterruptible I/O state (D). If LXD somehow gets stuck reading from such a process, this kind of lock up can happen.

None of that is normal though and it's usually an indication of something being quite off with the kernel or the system being very badly bottlenecked by something. Most often it's I/O starvation or running out of memory, though there can also be purely software reasons for it (like dealing with a fork bomb or misbehaving software that acts much like one).

I'm going to close the issue for now. But if this shows up again, please comment with ps fauxww and dmesg at the time this hits. For good measure, also capture df -h, free -m and uptime to get a bit of an idea of the state of the system.