Closed lenny87 closed 4 years ago
Can you upgrade to 4.0.2? I'm not sure that we have a fix for that in there necessarily but we have hundreds of fixes between 4.0.1 and 4.0.2 so let's rule that out first.
Hi, upgraded to 4.0.2 and the problem persists
config:
core.https_address: :8443
core.trust_password: true
images.auto_update_interval: "0"
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- resources_system
- usedby_consistency
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
addresses:
- :8443
architectures:
- x86_64
- i686
certificate: <hidden>
certificate_fingerprint: <hidden>
driver: lxc
driver_version: 4.0.2
firewall: xtables
kernel: Linux
kernel_architecture: x86_64
kernel_features:
netnsid_getifaddrs: "true"
seccomp_listener: "true"
seccomp_listener_continue: "false"
shiftfs: "false"
uevent_injection: "true"
unpriv_fscaps: "true"
kernel_version: 5.4.19-vsh0zfs083
lxc_features:
cgroup2: "true"
mount_injection_file: "true"
network_gateway_device_route: "true"
network_ipvlan: "true"
network_l2proxy: "true"
network_phys_macvlan_mtu: "true"
network_veth_router: "true"
pidfd: "false"
seccomp_notify: "true"
os_name: Ubuntu
os_version: "18.04"
project: default
server: lxd
server_clustered: false
server_name: znode-integration-102
server_pid: 4015
server_version: 4.0.2
storage: zfs
storage_version: 0.8.3-1ubuntu3~18.04.york1.0
``
Tried to reproduce it here without much success.
The logic in LXD is to track down the PID of forkproxy from /var/snap/lxd/common/lxd/devices/NAME/proxy.DEVICE
and kill that. Can you check that you do have that file and the PID is correct?
As it happens occasionally on our integration test platform (like one of 100 runs). I will try t ocheck if pid in the file is right when it happens again.
Thanks
@lenny87
Any luck?
(Closing for now due to no update, feel free to comment in here still and we'll re-open as needed)
This happened for us with LXD 4.0.4, I unfortunately didn't notice the issue before fixing this manually. The pid is fine after the restart, but no idea whether it was before..
It just happened for me with lxd 4.8, in a nested containers setup. The outermost container had this forkproxy stuck, even while no subcontainer even existed. I had to kill it manually.
Did you check the content of the pid file prior to killing it?
i faced the same problem
PID in files /var/lib/lxd/devices/NAME/proxy.*
is correct, but after stopping the container, forkproxy processes are not killed, which prevents the container from starting again
power:~ # cat /var/lib/lxd/devices/ha/proxy.*
name: /usr/bin/lxd
args: [forkproxy, --, "1593", "4", 'tcp:0.0.0.0:6052', "26059", "3", 'tcp:127.0.0.1:6052',
"", "", "0644", "", "", ""]
apparmor: lxd_forkproxy-esphomeport_ha_</var/lib/lxd>
pid: 26089
uid: 0
gid: 0
set_groups: false
sysprocattr: null
name: /usr/bin/lxd
args: [forkproxy, --, "1593", "4", 'tcp:0.0.0.0:8123', "26059", "3", 'tcp:127.0.0.1:8123',
"", "", "0644", "", "", ""]
apparmor: lxd_forkproxy-haport_ha_</var/lib/lxd>
pid: 26282
uid: 0
gid: 0
set_groups: false
sysprocattr: null
power:~ # ps aux | grep forkproxy
root 26012 0.0 0.0 3948 2176 pts/0 S+ 10:39 0:00 grep --color=auto forkproxy
4000000+ 26089 0.0 0.1 5565600 36864 ? Ssl 09:32 0:00 /usr/bin/lxd forkproxy -- 1593 4 tcp:0.0.0.0:6052 26059 3 tcp:127.0.0.1:6052 0644
4000000+ 26282 0.1 0.1 5713960 46820 ? Ssl 09:32 0:05 /usr/bin/lxd forkproxy -- 1593 4 tcp:0.0.0.0:8123 26059 3 tcp:127.0.0.1:8123 0644
every time I manually kill forkproxy processes.
power:~ # zypper info lxd
Loading repository data...
Reading installed packages...
Information for package lxd:
----------------------------
Repository : Base system (OSS)
Name : lxd
Version : 5.13-2.1
Arch : x86_64
Vendor : openSUSE
Installed Size : 246.8 MiB
Installed : Yes
Status : out-of-date (version 5.13-1.1 installed)
Source package : lxd-5.13-2.1.src
Upstream URL : https://linuxcontainers.org/lxd
Summary : Container hypervisor based on LXC
Description :
LXD is a system container manager. It offers a user experience
similar to virtual machines but uses Linux containers (LXC) instead.
Required information
Issue description
These is some race condition which causes forkproxy tcp proxy process not to exit after container is stopped. We're using it in container config like this:
After container shutdown, forkproxy keeps runnig indefinitely, so we're unable to delete the container without manually killing forkproxy process (because zfs backend dataset is busy).
If someone can tell me, where to get some more debug information, I can provide.