canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 925 forks source link

LXD slowness #9806

Closed MaximMonin closed 2 years ago

MaximMonin commented 2 years ago

Required information

config:
  core.https_address: 10.238.0.24:8443
  core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - 10.238.0.24:8443
  architectures:
  - x86_64
  - i686
  driver: lxc
  driver_version: 4.0.11
  firewall: xtables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.4.0-94-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "20.04"
  project: default
  server: lxd
  server_clustered: false
  server_name: titan7
  server_pid: 358868
  server_version: "4.21"
  storage: dir
  storage_version: "1"
  storage_supported_drivers:
  - name: lvm
    version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.41.0
    remote: false
  - name: dir
    version: "1"
    remote: false

Issue description

We are using lxc exec command to do some container management. Average reply from lxc exec call is about 120-150 ms on system with 1 container w/o any load. We have servers with 300+ lxd containers and average reply increases to 150-250ms. But in some moments ldx hangs and reply can be 5,10 second or even minutes. We are use lxd 4.21 during last month and slowness increases with new version. Now we are getting new lxd errors:

During last month we had 2 occurance when lxd completely hungs up and and all containers were unavailable (all lxc commands do not provide any output).

Steps to reproduce

  1. It is hard to reproduce situation. Cpu burner + io burner doesnt lead to issue. It seems situation more critical on servers with 300+ containers.
  2. When lxd console is hanging, server root console is working and clients container processes are working also.

Information to attach

  1. Today lxc list command do not provide any output, but: lxc query -X GET --wait /1.0/containers?recursion=1 was working
  2. Also lxc list -fast was working also
stgraber commented 2 years ago

Some of those errors are likely coming because we're now running with websocket heartbeats enable so if the client or server doesn't respond for a while, it will sever the connection, potentially causing one of those bad handshake or unexpected EOF. @tomponline may have some ideas for improvements in those areas.

When LXD is misbehaving, there are a few things you could do or grab which may be useful:

When lxc list hangs but ?recursion=1 doesn't, this normally indicates that one or multiple instances are not responding when LXD fetches their status. You could try to iterate through all of your instances, running lxc info against them and see which ones are hanging.

Another dump that can be useful is the dump of all ongoing goroutines in LXD. You can get that with:

krim404 commented 2 years ago

@stgraber i have similar problems but in a clustered enviroment.

# lxc cluster list
Error: context deadline exceeded

also commands will fail more often, like a restart when its not done within 30s. especially in large clusters it some times just takes a little bit longer... any chance to fix this?

tomponline commented 2 years ago

@stgraber i have similar problems but in a clustered enviroment.

# lxc cluster list
Error: context deadline exceeded

also commands will fail more often, like a restart when its not done within 30s. especially in large clusters it some times just takes a little bit longer... any chance to fix this?

There are no websockets (and thus no heartbeats) involved with the lxc cluster list command, so if you are getting context deadline exceeded there its most likely due to locking on the database. This can be caused by i/o saturation on the leader member and/or network packet loss or latency between the members. It can also be caused if the cluster has lost quorum and is no longer operational due to too many members becoming offline. I would suggest consulting the log file /var/snap/lxd/common/lxd/logs/lxd.log for each member and seeing if there are any more errors that could help identify the issue.

tomponline commented 2 years ago

We are use lxd 4.21 during last month and slowness increases with new version. Now we are getting new lxd errors:

* failed to begin transaction: context deadline exceeded

* websocket: bad handshake

* websocket: close 1006 (abnormal closure): unexpected EOF

The first of those errors is to do with the database and is not to do with websockets (or heartbeats), see my post https://github.com/lxc/lxd/issues/9806#issuecomment-1033521647 on how to investigate that more.

The second of those errors suggests that the websocket has never successfully been connected (as opposed to timing out due to inactivity), this can happen (at least with lxc copy, I'm not sure if its relevant to lxc exec though) when the target member of a copy fails to connect back to the source member within 10s, and the source member gives up. Similarly to the first error this can suggest that the target member is not performing well or the network between it and the leader is experiencing packet loss or high latency (LXD cluster expects to run in a LAN network environment rather than over a WAN).

See https://github.com/lxc/lxd/issues/9861 for potentially similar issue.

The last error could be to do with a close caused by a timeout, although for lxc exec there is no websocket timeout currently implemented (beyond the standard TCP keepalive timeout that has been there for years).

My suspicion is that this is due database issues caused either by i/o, network or excessive traffic on the leader member. Please can you provide the specific commands along with the errors you get (rather than just the errors as now), and any associated logs from /var/snap/lxd/common/lxd/logs/lxd.log on the affected members.

tomponline commented 2 years ago

Some of those errors are likely coming because we're now running with websocket heartbeats

@stgraber I checked just now and for clarity, right now, only the events websocket connections have websocket level heartbeats implemented (although exec sessions will have an associated events listener), and only the dqlite proxy and the migration source websocket have TCP user_timeouts (send) applied to them.

events listener: https://github.com/lxc/lxd/blob/49f90c047ced40b933e167016197b0d6fc9fb660/lxd/events/common.go#L43-L56 migrate source: https://github.com/lxc/lxd/blob/42bd6dcf91aa3fb0ae4a2ece1ff9c0c1fadd6a98/lxd/migrate.go#L185 dqlite proxy: https://github.com/lxc/lxd/blob/42bd6dcf91aa3fb0ae4a2ece1ff9c0c1fadd6a98/lxd/cluster/gateway.go#L1074

MaximMonin commented 2 years ago

My suspicion is that this is due database issues caused either by i/o, network or excessive traffic on the leader member. Please can you provide the specific commands along with the errors you get (rather than just the errors as now), and any associated logs from /var/snap/lxd/common/lxd/logs/lxd.log on the affected members.

  1. We do not use cluster, all nodes are standalone.
  2. All lxc exec commands are given from host node, so there no networking issue. Our host go client just issue to local lxd container: lxc exec management-agent (input data) -> read output data. Connection is done through lxd unix socket.

all a see in lxd logs are:

t=2022-02-08T11:35:18+0200 lvl=warn msg="Unexpected read on stdout websocket, killing command" PID=442238 err="websocket: close 1006 (abnormal closure): unexpected EOF" instance=ct4874 interactive=false number=1 project=default
t=2022-02-08T11:37:16+0200 lvl=warn msg="Unexpected read on stdout websocket, killing command" PID=448205 err="websocket: close 1006 (abnormal closure): unexpected EOF" instance=ct4874 interactive=false number=1 project=default
t=2022-02-08T11:44:41+0200 lvl=warn msg="Unexpected read on stdout websocket, killing command" PID=476069 err="websocket: close 1006 (abnormal closure): unexpected EOF" instance=ct4874 interactive=false number=1 project=default
tomponline commented 2 years ago

Unexpected read on stdout websocket

OK this is more useful.

The reason LXD is closing the connection is that its receiving data on its stdout channel going to the client (which it should not as it should only be used to send stdout data from LXD to the client).

See:

https://github.com/lxc/lxd/blob/5cc57a293988f30ccb74c6ae959c2447efcb1716/lxd/instance_exec.go#L415-L432

This "data" can include the stdout channel being disconnected, which then causes LXD to kill the command and end the exec session to avoid leaving go routines and processes around.

Can you give more information/examples on your usage that is triggering this scenario please?

tomponline commented 2 years ago

Also what version of the lxc command are you using? lxc --version

MaximMonin commented 2 years ago

Can you give more information/examples on your usage that is triggering this scenario please?

I checked our host go client log and it shows that there arent any websocket errors in this time frame. All commands were succesfully returned result. But I found this kind of errors in different time frame... trying to investigate more....

Also what version of the lxc command are you using? lxc --version

# lxc --version
4.22
tomponline commented 2 years ago

go client log

Ah so are you using a client different than the lxc CLI client tool?

MaximMonin commented 2 years ago

go client log

Ah so are you using a client different than the lxc CLI client tool?

yes, we are using https://github.com/lxc/lxd/tree/master/client version 4.21

ok investigations:

{"errors":"websocket: bad handshake","timestamp":"2022-02-06T01:38:41.870Z"}...

t=2022-02-06T03:38:36+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:38:39+0200 lvl=eror msg="Error getting disk usage" err="failed to begin transaction: context deadline exceeded" instance=ct3914 instanceType=container project=default
t=2022-02-06T03:38:39+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:38:39+0200 lvl=eror msg="Error getting disk usage" err="failed to begin transaction: context deadline exceeded" instance=ct9293 instanceType=container project=default
t=2022-02-06T03:39:13+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:39:13+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:39:13+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:39:13+0200 lvl=eror msg="Error loading storage pool" err="failed to begin transaction: context deadline exceeded" instance=ct9285 instanceType=container project=default
t=2022-02-06T03:39:13+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:39:13+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:39:13+0200 lvl=eror msg="Error loading storage pool" err="failed to begin transaction: context deadline exceeded" instance=ct9354 instanceType=container project=default
t=2022-02-06T03:39:14+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:39:16+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:39:33+0200 lvl=eror msg="Error loading storage pool" err="failed to begin transaction: context deadline exceeded" instance=ct9303 instanceType=container project=default

It seems it is night backup use case.

tomponline commented 2 years ago

Yeah so these are issues accessing the database rather than anything to do with websockets, looks like the machine maybe suffering I/O issues at that time.

stgraber commented 2 years ago

Closing for inactivity and because all previous posts point towards other I/O related issues which were causing the database to timeout.