canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.29k stars 921 forks source link

Error: Failed to retrieve PID of executing child process #12084

Open KLIM8D opened 12 months ago

KLIM8D commented 12 months ago

I have a cluster with 4 nodes. There are 44 containers spread across the nodes and atm I'm getting this error message trying to run lxc exec on 19 of the containers. Besides 3 containers, the other 41 are using ceph rbd for storage.

I can start the container and exec works for a while and then all of the sudden I'm getting the error message. Restarting the container works and I can get a shell or whatever again, but then after X amount of hours, it happens again.

$ lxc exec amqp-1 bash
Error: Failed to retrieve PID of executing child process

lxc info

config:
  cluster.https_address: 10.0.0.4:8443
  core.https_address: 10.0.0.4:8443
  images.auto_update_interval: "0"
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: REDACTED
auth_user_method: unix
environment:
  addresses:
  - 10.0.0.4:8443
  architectures:
  - x86_64
  - i686
  certificate: |
      REDACTED
  certificate_fingerprint: REDACTED
  driver: lxc | qemu
  driver_version: 5.0.2 | 8.0.0
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.15.0-76-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "20.04"
  project: default
  server: lxd
  server_clustered: true
  server_event_mode: full-mesh
  server_name: compute-lxc-host-1
  server_pid: 743649
  server_version: "5.15"
  storage: ceph | cephfs | lvm
  storage_version: 17.2.5 | 17.2.5 | 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08)
    / 4.45.0
  storage_supported_drivers:
  - name: ceph
    version: 17.2.5
    remote: true
  - name: cephfs
    version: 17.2.5
    remote: true
  - name: cephobject
    version: 17.2.5
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0
    remote: false
  - name: zfs
    version: 2.1.5-1ubuntu6~22.04.1
    remote: false
  - name: btrfs
    version: 5.16.2
    remote: false

lxc config show amqp-1 --expanded

architecture: x86_64
config:
  image.build_ver: 10f16957b38b3df1cb525889296835c4868c3d4661a7fcd1040d78da1c364379
  image.name: base_image
  image.os: ubuntu
  image.version: "20.04"
  limits.cpu: 12-19
  limits.memory: 12GB
  security.devlxd: "false"
  security.nesting: "true"
  security.protection.delete: "true"
  security.syscalls.deny_compat: "true"
  volatile.base_image: e03956f34d66c8dd43c7bc85d07345335e89d6b1b28b3fef2c70ced96017749e
  volatile.cloud-init.instance-id: a076608b-aca3-472a-9e81-5e7c5c0f7ed0
  volatile.eth0.host_name: vethed6ade5b
  volatile.eth0.hwaddr: 00:16:3e:8c:59:71
  volatile.eth1.host_name: vethc72de7a7
  volatile.eth1.hwaddr: 00:16:3e:12:ad:e4
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: RUNNING
  volatile.uuid: e8b176c3-7b65-4696-a098-fb27086498d3
  volatile.uuid.generation: e8b176c3-7b65-4696-a098-fb27086498d3
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  eth1:
    name: eth1
    nictype: bridged
    parent: br0
    type: nic
  root:
    path: /
    pool: cephpool
    type: disk
ephemeral: false
profiles:
- ceph
- vlan
stateful: false
description: ""

lxc monitor

location: compute-lxc-host-1
metadata:
  context:
    id: c02c13cd-3839-44d7-b6ff-5918ab19d4b3
    local: /var/snap/lxd/common/lxd/unix.socket
    remote: '@'
  level: debug
  message: Event listener server handler started
timestamp: "2023-07-27T18:17:15.501516869+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context:
    ip: '@'
    method: GET
    protocol: unix
    url: /1.0
    username: user
  level: debug
  message: Handling API request
timestamp: "2023-07-27T18:17:16.502796588+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context:
    ip: '@'
    method: GET
    protocol: unix
    url: /1.0/events
    username: user
  level: debug
  message: Handling API request
timestamp: "2023-07-27T18:17:16.505814424+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context:
    ip: '@'
    method: POST
    protocol: unix
    url: /1.0/instances/amqp-1/exec
    username: user
  level: debug
  message: Handling API request
timestamp: "2023-07-27T18:17:16.506461362+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context:
    id: 21a8c997-1f44-4367-b5de-e9d9ef9a0ad3
    local: /var/snap/lxd/common/lxd/unix.socket
    remote: '@'
  level: debug
  message: Event listener server handler started
timestamp: "2023-07-27T18:17:16.507238122+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: New operation
timestamp: "2023-07-27T18:17:16.523119245+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: Started operation
timestamp: "2023-07-27T18:17:16.523232686+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context: {}
  level: debug
  message: Waiting for exec websockets to connect
timestamp: "2023-07-27T18:17:16.523382177+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  class: websocket
  created_at: "2023-07-27T18:17:16.517874299+02:00"
  description: Executing command
  err: ""
  id: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
  location: compute-lxc-host-1
  may_cancel: false
  metadata:
    command:
    - ls
    environment:
      HOME: /root
      LANG: C.UTF-8
      PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
      TERM: screen-256color
      USER: root
    fds:
      "0": 46ac3977e021d59fb28e57013fd66cecee35e7ebfe62253f8efe4eab123c7921
      control: fb5aa23ec0b33d34be70fc80f5228d4b41dde01359be1a887a770de9b3e8514b
    interactive: true
  resources:
    containers:
    - /1.0/instances/amqp-1
    instances:
    - /1.0/instances/amqp-1
  status: Pending
  status_code: 105
  updated_at: "2023-07-27T18:17:16.517874299+02:00"
project: default
timestamp: "2023-07-27T18:17:16.523198189+02:00"
type: operation

location: compute-lxc-host-1
metadata:
  class: websocket
  created_at: "2023-07-27T18:17:16.517874299+02:00"
  description: Executing command
  err: ""
  id: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
  location: compute-lxc-host-1
  may_cancel: false
  metadata:
    command:
    - ls
    environment:
      HOME: /root
      LANG: C.UTF-8
      PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
      TERM: screen-256color
      USER: root
    fds:
      "0": 46ac3977e021d59fb28e57013fd66cecee35e7ebfe62253f8efe4eab123c7921
      control: fb5aa23ec0b33d34be70fc80f5228d4b41dde01359be1a887a770de9b3e8514b
    interactive: true
  resources:
    containers:
    - /1.0/instances/amqp-1
    instances:
    - /1.0/instances/amqp-1
  status: Running
  status_code: 103
  updated_at: "2023-07-27T18:17:16.517874299+02:00"
project: default
timestamp: "2023-07-27T18:17:16.523295708+02:00"
type: operation

location: compute-lxc-host-1
metadata:
  context:
    ip: '@'
    method: GET
    protocol: unix
    url: /1.0/operations/47030eed-2f0d-4cd8-bb3c-b4eea03f6a83/websocket?secret=fb5aa23ec0b33d34be70fc80f5228d4b41dde01359be1a887a770de9b3e8514b
    username: user
  level: debug
  message: Handling API request
timestamp: "2023-07-27T18:17:16.527555079+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: Connecting to operation
timestamp: "2023-07-27T18:17:16.527615106+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: Connected to operation
timestamp: "2023-07-27T18:17:16.527857206+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: Connecting to operation
timestamp: "2023-07-27T18:17:16.528521127+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context:
    ip: '@'
    method: GET
    protocol: unix
    url: /1.0/operations/47030eed-2f0d-4cd8-bb3c-b4eea03f6a83/websocket?secret=46ac3977e021d59fb28e57013fd66cecee35e7ebfe62253f8efe4eab123c7921
    username: user
  level: debug
  message: Handling API request
timestamp: "2023-07-27T18:17:16.528447454+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: Connected to operation
timestamp: "2023-07-27T18:17:16.528669575+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context:
    ip: '@'
    method: GET
    protocol: unix
    url: /1.0/operations/47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    username: user
  level: debug
  message: Handling API request
timestamp: "2023-07-27T18:17:16.529379566+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context:
    instance: amqp-1
    instanceType: container
    project: default
  level: error
  message: Failed to retrieve PID of executing child process
timestamp: "2023-07-27T18:17:16.53379002+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: Updated metadata for operation
timestamp: "2023-07-27T18:17:16.533915644+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  class: websocket
  created_at: "2023-07-27T18:17:16.517874299+02:00"
  description: Executing command
  err: ""
  id: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
  location: compute-lxc-host-1
  may_cancel: false
  metadata:
    command:
    - ls
    environment:
      HOME: /root
      LANG: C.UTF-8
      PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
      TERM: screen-256color
      USER: root
    fds:
      "0": 46ac3977e021d59fb28e57013fd66cecee35e7ebfe62253f8efe4eab123c7921
      control: fb5aa23ec0b33d34be70fc80f5228d4b41dde01359be1a887a770de9b3e8514b
    interactive: true
    return: -1
  resources:
    containers:
    - /1.0/instances/amqp-1
    instances:
    - /1.0/instances/amqp-1
  status: Running
  status_code: 103
  updated_at: "2023-07-27T18:17:16.533893962+02:00"
project: default
timestamp: "2023-07-27T18:17:16.533984728+02:00"
type: operation

location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    err: Failed to retrieve PID of executing child process
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: Failure for operation
timestamp: "2023-07-27T18:17:16.534054213+02:00"
type: logging

location: compute-lxc-host-1
metadata:
  class: websocket
  created_at: "2023-07-27T18:17:16.517874299+02:00"
  description: Executing command
  err: Failed to retrieve PID of executing child process
  id: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
  location: compute-lxc-host-1
  may_cancel: false
  metadata:
    command:
    - ls
    environment:
      HOME: /root
      LANG: C.UTF-8
      PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
      TERM: screen-256color
      USER: root
    fds:
      "0": 46ac3977e021d59fb28e57013fd66cecee35e7ebfe62253f8efe4eab123c7921
      control: fb5aa23ec0b33d34be70fc80f5228d4b41dde01359be1a887a770de9b3e8514b
    interactive: true
    return: -1
  resources:
    containers:
    - /1.0/instances/amqp-1
    instances:
    - /1.0/instances/amqp-1
  status: Failure
  status_code: 400
  updated_at: "2023-07-27T18:17:16.533893962+02:00"
project: default
timestamp: "2023-07-27T18:17:16.534132375+02:00"
type: operation

location: compute-lxc-host-1
metadata:
  context:
    listener: 21a8c997-1f44-4367-b5de-e9d9ef9a0ad3
    local: /var/snap/lxd/common/lxd/unix.socket
    remote: '@'
  level: debug
  message: Event listener server handler stopped
timestamp: "2023-07-27T18:17:16.537145693+02:00"
type: logging
tomponline commented 12 months ago

This comes up from time to time and is usually a process on your system periodically cleaning the tmp directory.

tomponline commented 12 months ago

Try https://forum.snapcraft.io/t/cleanup-of-tmp-snap-private-tmp-snap-lxd-removes-ability-to-connect-to-running-containers/35645

KLIM8D commented 12 months ago

@tomponline Thanks! That seems about right, could also be the reason why I experience this now when switching to the snap lxd version.

But I'm a bit confused, what should the contents of /usr/lib/tmpfiles.d/snapd.conf be, atm it contains:

D! /tmp/snap-private-tmp 0700 root root -

Should I change the D! to an x or append the line x /tmp/snap-private-tmp/snap.lxd above? What should be restarted afterwards?

KLIM8D commented 12 months ago

I've added it above and tested with SYSTEMD_LOG_LEVEL=debug /usr/bin/systemd-tmpfiles --clean

Ignoring entry D! "/tmp/snap-private-tmp" because --boot is not specified.
Running clean action for entry x /tmp/snap-private-tmp/snap.lxd
Running clean action for entry D /tmp

systemd-tmpfiles pages states

If the exclamation mark ("!") is used, this line is only safe to execute during boot, and can break a running system. Lines without the exclamation mark are presumed to be safe to execute at any time, e.g. on package upgrades. systemd-tmpfiles(8) will take lines with an exclamation mark only into consideration, if the --boot option is given.

But does the third line mean that the snap.lxd directory were still cleaned, even though D! was specified for its parent dir? Just trying to understand and make sure it wont happen again.

tomponline commented 12 months ago

Looking at man tmpfiles.d it says:

D     /directory/to/create-and-remove          mode user group cleanup-age -
x     /path-or-glob/to/ignore/recursively      -    -    -     -           -
...
If the exclamation mark ("!") is used, this line is only safe to execute during boot, and can break a running system. Lines without the exclamation mark are presumed to be safe to execute at any
       time, e.g. on package upgrades.  systemd-tmpfiles will take lines with an exclamation mark only into consideration, if the --boot option is given.

So you would want to be adding a new line to /usr/lib/tmpfiles.d/snapd.conf or adding a new file to /usr/lib/tmpfiles.d/. Whether to use D! or x, I'm not sure, but looking now, LXD creates /tmp/snap-private-tmp/snap.lxd/tmp/ so its possible that using D! /tmp/snap-private-tmp/snap.lxd/ will still cause /tmp/snap-private-tmp/snap.lxd/tmp to be cleaned normally, so using x is probably the safer choice.

tomponline commented 11 months ago

Chatting with the snapd team theres been a suggestion that perhaps LXD should use the XDG_RUNTIME_DIR path for its runtime state files (i.e. /run/user/0/snap. in case of daemons) to avoid systemd-tmpfiles from cleaning up these files periodically.

KLIM8D commented 11 months ago

Unfortunately it didn't solve the error first described, after a few days the "Error: Failed to retrieve PID..." is back.

Contents of /usr/lib/tmpfiles.d/snapd.conf

x /tmp/snap-private-tmp/snap.lxd
D! /tmp/snap-private-tmp 0700 root root -

Trying console instead

$ lxc console mongo-2
To detach from the console, press: <ctrl>+a q
Error: Error opening config file: "loading config file for the container failed"
Error: write /dev/pts/ptmx: file already closed
tomponline commented 11 months ago

You reloaded the systemd-tmpfiles service after making the change?

KLIM8D commented 11 months ago

Yes, prettry sure I did. Reloaded systemd with systemctl daemon-reload, just to be sure and systemctl restart systemd-tmpfiles-clean.timer

Just did it again on all LXD hosts. I'll restart all the faulty containers and see if it happens again within a day or two.

tomponline commented 11 months ago

You could try stopping and disabling the service for a while too and see if that helps.

KLIM8D commented 11 months ago

Instances restarted two days ago, are now giving this error again. It's not all the containers that were restarted, that gives this error again, for some which were restarted, I'm still able to use "lxc exec". So a bit inconsistent.

I'm beginning to have my doubts it has something to do with the systemd-tmpfiles-clean job. Are there happening anything special with or "inside" the snap package? Older version though, but never had the issue with non-snap LXD.

KLIM8D commented 11 months ago

What's strange is that I have one LXD cluster host, where not a single instance has this issue and containers have been running for many days. Haven't spotted the difference between that cluster host and the others yet, other than is has "database standby" role.

KLIM8D commented 11 months ago

Found this thread (https://discuss.linuxcontainers.org/t/forkexec-failed-to-load-config-file/16220/9) and looked at my hosts to see if this also was the case, that this lxc.conf file was missing. What was interesting is that on all of the 3 hosts, where containers fail, this config file is missing in the /var/snap/lxd/common/lxd/logs/CONTAINER directory. And on the host where I don't see this problem, the lxc.conf file is there for all the containers.

On one of the failing LXD hosts, this is the content of forkexec.log in the container directory

Failed to load config file /var/snap/lxd/common/lxd/logs/db-lb-3/lxc.conf for /var/snap/lxd/common/lxd/containers/db-lb-3

There are only forkexec.log and lxc.log any of the container directories within .../lxd/logs

# ls -l /var/snap/lxd/common/lxd/logs/db-lb-3/
/var/snap/lxd/common/lxd/logs/db-lb-3:
forkexec.log  lxc.log

Could this be the reason I'm getting this error?

Restarting the container, all these file are present again

# ls -l /var/snap/lxd/common/lxd/logs/db-lb-3/
total 100
-rw------- 1 root root  6483 Aug 14 10:36 console.log
-rw-r--r-- 1 root root   122 Aug 14 10:10 forkexec.log
-rw-r--r-- 1 root root     0 Aug 14 10:36 forkstart.log
-rw-r----- 1 root root  2508 Aug 14 10:36 lxc.conf
-rw-r----- 1 root root 80587 Aug 14 10:36 lxc.log
-rw-r----- 1 root root  2939 Aug 14 10:36 lxc.log.old
jonoinnz commented 11 months ago

We've also been experiencing the same problem in our environments, basically, we created clusters, and even out of the box if we leave the cluster for a while we are seeing the same error while trying to exec into the containers, we can reboot the containers to bash into them again but that is not convenient in a production environment for obvious reasons. It would be great if we can be provided some instructions on how to go about troubleshooting this.

KLIM8D commented 11 months ago

Manually restoring lxc.conf in /var/snap/lxd/common/lxd/logs/<CONTAINER> makes a faulty container, that gives this error, exec commands again. Just took NIC-values from lxc config show and grapped the PID for the lxc.hook.pre-start line.

So the questions is, why does this file suddenly disappear...?

petrosssit commented 10 months ago

Thank you @KLIM8D

We have the same issue and think of a workaround for now, to keep a copy of lxc.conf when a container starts, and have a way of alerting when lxc.conf disappears, and of sending it back.

To have it in a logs directory is certainly odd, a /run subdir would be more appropriate I would say.

On top of that we seem to have at times issues with some logrotation in this directory. We had lxc.log files recreated with 0 bytes while the lxc monitor process keeps the old file open, filing up /var, and only a restart of the container releases the space of the already deleted file.

I am not sure whether this misbehaviour of logrotation may also delete the lxc.conf in the same directory. I have not figured out where the logrotation configuration for the log files here is kept.

KLIM8D commented 10 months ago

Apparently it is the entire content, or at least all the containers directories in /var/snap/lxd/common/lxd/logs/ that is cleaned up by something. Seems OK if it wasn't the fact that the lxc.conf is needed..

Tried setting a rule in auditd, to see what deleted the lxc.conf file and to my surprise, no containers directory was in there. Running exec for either of the containers creates a directory for the container in /var/snap/lxd/common/lxd/logs/ but it only contains forkexec.log and lxc.log.

@petrosssit So maybe you should check in your workaround script, if the directory is there or not.

# ls /var/snap/lxd/common/lxd/logs/
dnsmasq.lxdbr0.log  lxd.log 

After exec for test-redis container

# ls /var/snap/lxd/common/lxd/logs/*
/var/snap/lxd/common/lxd/logs/dnsmasq.lxdbr0.log  /var/snap/lxd/common/lxd/logs/lxd.log

/var/snap/lxd/common/lxd/logs/test-redis:
forkexec.log  lxc.log
nfournil commented 9 months ago

Same issue with LXC/D 5.0.2 from Debian Bookworm ... lxc.conf (writen in /var/log/lxd/[container]/lxc.conf disapears.

More answer on this ?

nfournil commented 8 months ago

To be more precise on my setup I work with CEPH as storage backend.

To go further, I've added a script who periodically check lxc.conf (1 test / 10s) presence and keep all API logs : the file disapears when the cleaning of expired backup task started.

I've also seen this commit who is not present on the refactored code (security on lxc.conf on delete operations) : https://github.com/canonical/lxd/pull/4010/files?diff=split&w=0

Hope we could find more on this bug...

Nice workaround to regenerate lxc.conf : lxc config show [container] > /var/log/lxd/[container]/lxc.conf

capriciousduck commented 8 months ago

@nfournil A simple empty lxc.conf file works too.

touch /var/snap/lxd/common/lxd/logs/[container]/lxc.conf

¯_(ツ)_/¯

Edit: "lxc exec" with an empty lxc.conf works, but later data is added to that lxc.conf file (right after I did an lxc exec). So it's not really empty all the while.

tomponline commented 8 months ago

I've also seen this commit who is not present on the refactored code (security on lxc.conf on delete operations) : https://github.com/canonical/lxd/pull/4010/files?diff=split&w=0

Hi, please could you clarify what you mean here?

nfournil commented 8 months ago

Il you go to this file (who seems to be the 5.x place for this function) : you find the protection to lxc.log file, but not to lxc.conf anymore ...

https://github.com/canonical/lxd/blob/f14fc05ed333006ffe344e7268bbc7fda3994596/lxd/instance_logs.go#L295

Le mar. 31 oct. 2023 à 17:06, Tom Parrott @.***> a écrit :

I've also seen this commit who is not present on the refactored code (security on lxc.conf on delete operations) : https://github.com/canonical/lxd/pull/4010/files?diff=split&w=0

Hi, please could you clarify what you mean here?

— Reply to this email directly, view it on GitHub https://github.com/canonical/lxd/issues/12084#issuecomment-1787526846, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYB3TEUCASQ2QPN6H3ZM3DYCEOXPAVCNFSM6AAAAAA22KN24WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBXGUZDMOBUGY . You are receiving this because you were mentioned.Message ID: @.***>

nfournil commented 8 months ago

Had added a more agressive (1 check / s if lxc.conf file exists) logging and I confirm that file disapears when the API calls :

location: srv05-r2b-fl1 metadata: context: class: task description: Expiring log files operation: 42d4066c-3b40-4649-867b-637556984e7a project: "" level: debug message: Started operation timestamp: "2023-11-08T11:00:47.092443683+01:00" type: logging

less than 1/2s next, file doesn't exists anymore...

tomponline commented 4 months ago

@mihalicyn I think if we cherry-pick this https://github.com/lxc/incus/pull/361 it should help.

tomponline commented 3 weeks ago

Reopening due to https://github.com/canonical/lxd/issues/12084#issuecomment-1794972551

tomponline commented 3 weeks ago

Looks likeinstanceLogDelete is preventing lxc.cong from being deleted

https://github.com/canonical/lxd/blob/8f2d07c73c0e1ed89f8af7f55d72361a9d46426a/lxd/instance_logs.go#L344-L346

tomponline commented 3 weeks ago

One solution to this might be to stop using tmp and log dirs for .conf files, similar to https://github.com/lxc/incus/pull/426