canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 926 forks source link

forkproxy stuck after container shutdown #7687

Closed lenny87 closed 4 years ago

lenny87 commented 4 years ago

Required information

Issue description

These is some race condition which causes forkproxy tcp proxy process not to exit after container is stopped. We're using it in container config like this:

        bind: container
        connect: tcp::10606
        listen: tcp::10606
        type: proxy

After container shutdown, forkproxy keeps runnig indefinitely, so we're unable to delete the container without manually killing forkproxy process (because zfs backend dataset is busy).

If someone can tell me, where to get some more debug information, I can provide.

stgraber commented 4 years ago

Can you upgrade to 4.0.2? I'm not sure that we have a fix for that in there necessarily but we have hundreds of fixes between 4.0.1 and 4.0.2 so let's rule that out first.

lenny87 commented 4 years ago

Hi, upgraded to 4.0.2 and the problem persists


config:
  core.https_address: :8443
  core.trust_password: true
  images.auto_update_interval: "0"
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- resources_system
- usedby_consistency
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - :8443
  architectures:
  - x86_64
  - i686
  certificate: <hidden>
  certificate_fingerprint:  <hidden>
  driver: lxc
  driver_version: 4.0.2
  firewall: xtables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "false"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.4.19-vsh0zfs083
  lxc_features:
    cgroup2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "false"
    seccomp_notify: "true"
  os_name: Ubuntu
  os_version: "18.04"
  project: default
  server: lxd
  server_clustered: false
  server_name: znode-integration-102
  server_pid: 4015
  server_version: 4.0.2
  storage: zfs
  storage_version: 0.8.3-1ubuntu3~18.04.york1.0
``
stgraber commented 4 years ago

Tried to reproduce it here without much success.

The logic in LXD is to track down the PID of forkproxy from /var/snap/lxd/common/lxd/devices/NAME/proxy.DEVICE and kill that. Can you check that you do have that file and the PID is correct?

lenny87 commented 4 years ago

As it happens occasionally on our integration test platform (like one of 100 runs). I will try t ocheck if pid in the file is right when it happens again.

Thanks

stgraber commented 4 years ago

@lenny87

Any luck?

stgraber commented 4 years ago

(Closing for now due to no update, feel free to comment in here still and we'll re-open as needed)

sfPlayer1 commented 3 years ago

This happened for us with LXD 4.0.4, I unfortunately didn't notice the issue before fixing this manually. The pid is fine after the restart, but no idea whether it was before..

strk commented 3 years ago

It just happened for me with lxd 4.8, in a nested containers setup. The outermost container had this forkproxy stuck, even while no subcontainer even existed. I had to kill it manually.

stgraber commented 3 years ago

Did you check the content of the pid file prior to killing it?

13werwolf13 commented 1 year ago

i faced the same problem

PID in files /var/lib/lxd/devices/NAME/proxy.* is correct, but after stopping the container, forkproxy processes are not killed, which prevents the container from starting again

power:~ # cat /var/lib/lxd/devices/ha/proxy.*
name: /usr/bin/lxd
args: [forkproxy, --, "1593", "4", 'tcp:0.0.0.0:6052', "26059", "3", 'tcp:127.0.0.1:6052',
  "", "", "0644", "", "", ""]
apparmor: lxd_forkproxy-esphomeport_ha_</var/lib/lxd>
pid: 26089
uid: 0
gid: 0
set_groups: false
sysprocattr: null
name: /usr/bin/lxd
args: [forkproxy, --, "1593", "4", 'tcp:0.0.0.0:8123', "26059", "3", 'tcp:127.0.0.1:8123',
  "", "", "0644", "", "", ""]
apparmor: lxd_forkproxy-haport_ha_</var/lib/lxd>
pid: 26282
uid: 0
gid: 0
set_groups: false
sysprocattr: null
power:~ # ps aux | grep forkproxy
root     26012  0.0  0.0   3948  2176 pts/0    S+   10:39   0:00 grep --color=auto forkproxy
4000000+ 26089  0.0  0.1 5565600 36864 ?       Ssl  09:32   0:00 /usr/bin/lxd forkproxy -- 1593 4 tcp:0.0.0.0:6052 26059 3 tcp:127.0.0.1:6052   0644
4000000+ 26282  0.1  0.1 5713960 46820 ?       Ssl  09:32   0:05 /usr/bin/lxd forkproxy -- 1593 4 tcp:0.0.0.0:8123 26059 3 tcp:127.0.0.1:8123   0644

every time I manually kill forkproxy processes.

power:~ # zypper info lxd
Loading repository data...
Reading installed packages...

Information for package lxd:
----------------------------
Repository     : Base system (OSS)
Name           : lxd
Version        : 5.13-2.1
Arch           : x86_64
Vendor         : openSUSE
Installed Size : 246.8 MiB
Installed      : Yes
Status         : out-of-date (version 5.13-1.1 installed)
Source package : lxd-5.13-2.1.src
Upstream URL   : https://linuxcontainers.org/lxd
Summary        : Container hypervisor based on LXC
Description    : 
    LXD is a system container manager. It offers a user experience
    similar to virtual machines but uses Linux containers (LXC) instead.