sleepymole commented 1 year ago

Required information

Distribution: CentOS
Distribution version: 7
The output of "lxc info" or if that fails:
- Kernel version: 3.10.0-957.el7.x86_64
- LXC version: 5.3
- LXD version: 5.3
- Storage backend in use: lvm

lxc info

```go config: cluster.https_address: 10.2.4.8:8443 core.https_address: 10.2.4.8:8443 api_extensions: - storage_zfs_remove_snapshots - container_host_shutdown_timeout - container_stop_priority - container_syscall_filtering - auth_pki - container_last_used_at - etag - patch - usb_devices - https_allowed_credentials - image_compression_algorithm - directory_manipulation - container_cpu_time - storage_zfs_use_refquota - storage_lvm_mount_options - network - profile_usedby - container_push - container_exec_recording - certificate_update - container_exec_signal_handling - gpu_devices - container_image_properties - migration_progress - id_map - network_firewall_filtering - network_routes - storage - file_delete - file_append - network_dhcp_expiry - storage_lvm_vg_rename - storage_lvm_thinpool_rename - network_vlan - image_create_aliases - container_stateless_copy - container_only_migration - storage_zfs_clone_copy - unix_device_rename - storage_lvm_use_thinpool - storage_rsync_bwlimit - network_vxlan_interface - storage_btrfs_mount_options - entity_description - image_force_refresh - storage_lvm_lv_resizing - id_map_base - file_symlinks - container_push_target - network_vlan_physical - storage_images_delete - container_edit_metadata - container_snapshot_stateful_migration - storage_driver_ceph - storage_ceph_user_name - resource_limits - storage_volatile_initial_source - storage_ceph_force_osd_reuse - storage_block_filesystem_btrfs - resources - kernel_limits - storage_api_volume_rename - macaroon_authentication - network_sriov - console - restrict_devlxd - migration_pre_copy - infiniband - maas_network - devlxd_events - proxy - network_dhcp_gateway - file_get_symlink - network_leases - unix_device_hotplug - storage_api_local_volume_handling - operation_description - clustering - event_lifecycle - storage_api_remote_volume_handling - nvidia_runtime - container_mount_propagation - container_backup - devlxd_images - container_local_cross_pool_handling - proxy_unix - proxy_udp - clustering_join - proxy_tcp_udp_multi_port_handling - network_state - proxy_unix_dac_properties - container_protection_delete - unix_priv_drop - pprof_http - proxy_haproxy_protocol - network_hwaddr - proxy_nat - network_nat_order - container_full - candid_authentication - backup_compression - candid_config - nvidia_runtime_config - storage_api_volume_snapshots - storage_unmapped - projects - candid_config_key - network_vxlan_ttl - container_incremental_copy - usb_optional_vendorid - snapshot_scheduling - snapshot_schedule_aliases - container_copy_project - clustering_server_address - clustering_image_replication - container_protection_shift - snapshot_expiry - container_backup_override_pool - snapshot_expiry_creation - network_leases_location - resources_cpu_socket - resources_gpu - resources_numa - kernel_features - id_map_current - event_location - storage_api_remote_volume_snapshots - network_nat_address - container_nic_routes - rbac - cluster_internal_copy - seccomp_notify - lxc_features - container_nic_ipvlan - network_vlan_sriov - storage_cephfs - container_nic_ipfilter - resources_v2 - container_exec_user_group_cwd - container_syscall_intercept - container_disk_shift - storage_shifted - resources_infiniband - daemon_storage - instances - image_types - resources_disk_sata - clustering_roles - images_expiry - resources_network_firmware - backup_compression_algorithm - ceph_data_pool_name - container_syscall_intercept_mount - compression_squashfs - container_raw_mount - container_nic_routed - container_syscall_intercept_mount_fuse - container_disk_ceph - virtual-machines - image_profiles - clustering_architecture - resources_disk_id - storage_lvm_stripes - vm_boot_priority - unix_hotplug_devices - api_filtering - instance_nic_network - clustering_sizing - firewall_driver - projects_limits - container_syscall_intercept_hugetlbfs - limits_hugepages - container_nic_routed_gateway - projects_restrictions - custom_volume_snapshot_expiry - volume_snapshot_scheduling - trust_ca_certificates - snapshot_disk_usage - clustering_edit_roles - container_nic_routed_host_address - container_nic_ipvlan_gateway - resources_usb_pci - resources_cpu_threads_numa - resources_cpu_core_die - api_os - container_nic_routed_host_table - container_nic_ipvlan_host_table - container_nic_ipvlan_mode - resources_system - images_push_relay - network_dns_search - container_nic_routed_limits - instance_nic_bridged_vlan - network_state_bond_bridge - usedby_consistency - custom_block_volumes - clustering_failure_domains - resources_gpu_mdev - console_vga_type - projects_limits_disk - network_type_macvlan - network_type_sriov - container_syscall_intercept_bpf_devices - network_type_ovn - projects_networks - projects_networks_restricted_uplinks - custom_volume_backup - backup_override_name - storage_rsync_compression - network_type_physical - network_ovn_external_subnets - network_ovn_nat - network_ovn_external_routes_remove - tpm_device_type - storage_zfs_clone_copy_rebase - gpu_mdev - resources_pci_iommu - resources_network_usb - resources_disk_address - network_physical_ovn_ingress_mode - network_ovn_dhcp - network_physical_routes_anycast - projects_limits_instances - network_state_vlan - instance_nic_bridged_port_isolation - instance_bulk_state_change - network_gvrp - instance_pool_move - gpu_sriov - pci_device_type - storage_volume_state - network_acl - migration_stateful - disk_state_quota - storage_ceph_features - projects_compression - projects_images_remote_cache_expiry - certificate_project - network_ovn_acl - projects_images_auto_update - projects_restricted_cluster_target - images_default_architecture - network_ovn_acl_defaults - gpu_mig - project_usage - network_bridge_acl - warnings - projects_restricted_backups_and_snapshots - clustering_join_token - clustering_description - server_trusted_proxy - clustering_update_cert - storage_api_project - server_instance_driver_operational - server_supported_storage_drivers - event_lifecycle_requestor_address - resources_gpu_usb - clustering_evacuation - network_ovn_nat_address - network_bgp - network_forward - custom_volume_refresh - network_counters_errors_dropped - metrics - image_source_project - clustering_config - network_peer - linux_sysctl - network_dns - ovn_nic_acceleration - certificate_self_renewal - instance_project_move - storage_volume_project_move - cloud_init - network_dns_nat - database_leader - instance_all_projects - clustering_groups - ceph_rbd_du - instance_get_full - qemu_metrics - gpu_mig_uuid - event_project - clustering_evacuation_live - instance_allow_inconsistent_copy - network_state_ovn - storage_volume_api_filtering - image_restrictions - storage_zfs_export - network_dns_records - storage_zfs_reserve_space - network_acl_log - storage_zfs_blocksize - metrics_cpu_seconds - instance_snapshot_never - certificate_token - instance_nic_routed_neighbor_probe - event_hub - agent_nic_config - projects_restricted_intercept - metrics_authentication - images_target_project - cluster_migration_inconsistent_copy - cluster_ovn_chassis - container_syscall_intercept_sched_setscheduler - storage_lvm_thinpool_metadata_size - storage_volume_state_total - instance_file_head - instances_nic_host_name - image_copy_profile - container_syscall_intercept_sysinfo - clustering_evacuation_mode - resources_pci_vpd - qemu_raw_conf - storage_cephfs_fscache api_status: stable api_version: "1.0" auth: trusted public: false auth_methods: - tls environment: addresses: - 10.2.4.8:8443 architectures: - x86_64 - i686 certificate: | certificate_fingerprint: driver: lxc | qemu driver_version: 5.0.0 | 7.0.0 firewall: xtables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "false" netnsid_getifaddrs: "false" seccomp_listener: "false" seccomp_listener_continue: "false" shiftfs: "false" uevent_injection: "false" unpriv_fscaps: "true" kernel_version: 3.10.0-957.el7.x86_64 lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: CentOS Linux os_version: "7" project: tispace server: lxd server_clustered: true server_event_mode: full-mesh server_name: ks-14 server_pid: 17468 server_version: "5.3" storage: dir | lvm storage_version: 1 | 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.37.1 storage_supported_drivers: - name: btrfs version: 5.4.1 remote: false - name: cephfs version: 15.2.16 remote: true - name: dir version: "1" remote: false - name: lvm version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.37.1 remote: false - name: ceph version: 15.2.16 remote: true ```

Issue description

From lxc cluster ls, a member (ks-14) was offline for about 115hours.

+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| NAME  |          URL           |      ROLES       | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION |  STATE  |                                     MESSAGE                                      |
+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| ks-11 | https://10.2.4.11:8443 | database-standby | x86_64       | default        |             | ONLINE  | Fully operational                                                                |
+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| ks-12 | https://10.2.4.10:8443 |                  | x86_64       | default        |             | ONLINE  | Fully operational                                                                |
+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| ks-13 | https://10.2.4.9:8443  |                  | x86_64       | default        |             | ONLINE  | Fully operational                                                                |
+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| ks-14 | https://10.2.4.8:8443  |                  | x86_64       | default        |             | OFFLINE | No heartbeat for 115h16m0.928389834s (2022-12-21 16:09:29.158437119 +0800 +0800) |
+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| ks-15 | https://10.2.4.7:8443  | database         | x86_64       | default        |             | ONLINE  | Fully operational                                                                |
+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| ks-16 | https://10.2.4.5:8443  | database-leader  | x86_64       | default        |             | ONLINE  | Fully operational                                                                |
|       |                        | database         |              |                |             |         |                                                                                  |
+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| ks-17 | https://10.2.4.6:8443  |                  | x86_64       | default        |             | ONLINE  | Fully operational                                                                |
+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| ks-18 | https://10.2.4.4:8443  |                  | x86_64       | default        |             | ONLINE  | Fully operational                                                                |
+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| ks-19 | https://10.2.4.3:8443  |                  | x86_64       | default        |             | ONLINE  | Fully operational                                                                |
+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| ks-20 | https://10.2.4.2:8443  | database         | x86_64       | default        |             | ONLINE  | Fully operational                                                                |
+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| ks-21 | https://10.2.4.12:8443 |                  | x86_64       | default        |             | ONLINE  | Fully operational                                                                |
+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| ks-22 | https://10.2.4.13:8443 |                  | x86_64       | default        |             | ONLINE  | Fully operational                                                                |
+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| ks-23 | https://10.2.4.14:8443 | database-standby | x86_64       | default        |             | ONLINE  | Fully operational                                                                |
+-------+------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+

I checked the network connection between ks-14 and all other nodes. It seems the network is no problem. I observed that lxd on ks-14 doesn't listen on 8443, so I guess this is why cluster think it is offline. But the lxd process is still running.

[root@KS-14 ~]# ss -nutpa | grep 8443
tcp    ESTAB      0      0      10.2.4.8:43268              10.2.4.9:8443                users:(("lxd",pid=17468,fd=52))
tcp    TIME-WAIT  0      0      10.2.4.8:53510              10.2.4.2:8443               
tcp    ESTAB      0      0      10.2.4.8:53436              10.2.4.4:8443                users:(("lxd",pid=17468,fd=45))
tcp    ESTAB      0      0      10.2.4.8:8443               10.2.4.4:35618               users:(("lxd",pid=17468,fd=54))
tcp    ESTAB      0      0      10.2.4.8:33238              10.2.4.5:8443                users:(("lxd",pid=17468,fd=42))
tcp    ESTAB      0      0      10.2.4.8:48712              10.2.4.12:8443                users:(("lxd",pid=17468,fd=30))
tcp    ESTAB      0      0      10.2.4.8:43232              10.2.4.9:8443                users:(("lxd",pid=17468,fd=22))
tcp    ESTAB      0      0      10.2.4.8:8443               10.2.4.9:60250               users:(("lxd",pid=17468,fd=57))
tcp    ESTAB      0      0      10.2.4.8:53406              10.2.4.4:8443                users:(("lxd",pid=17468,fd=34))
tcp    ESTAB      0      0      10.2.4.8:8443               10.2.4.5:54582               users:(("lxd",pid=17468,fd=47))
tcp    ESTAB      0      0      10.2.4.8:51526              10.2.4.14:8443                users:(("lxd",pid=17468,fd=38))
tcp    ESTAB      0      0      10.2.4.8:33172              10.2.4.5:8443                users:(("lxd",pid=17468,fd=24))
tcp    ESTAB      0      0      10.2.4.8:59224              10.2.4.2:8443                users:(("lxd",pid=17468,fd=37))
tcp    ESTAB      0      0      10.2.4.8:33228              10.2.4.5:8443                users:(("lxd",pid=17468,fd=29))
tcp    ESTAB      0      0      10.2.4.8:57032              10.2.4.13:8443                users:(("lxd",pid=17468,fd=33))
tcp    ESTAB      0      0      10.2.4.8:54496              10.2.4.6:8443                users:(("lxd",pid=17468,fd=28))
tcp    ESTAB      0      0      10.2.4.8:34024              10.2.4.7:8443                users:(("lxd",pid=17468,fd=36))
tcp    ESTAB      0      0      10.2.4.8:48182              10.2.4.3:8443                users:(("lxd",pid=17468,fd=35))
tcp    ESTAB      0      0      10.2.4.8:45682              10.2.4.10:8443                users:(("lxd",pid=17468,fd=23))

[root@KS-14 ~]# ps -ef | grep lxd
root       15992       1  0 Dec20 ?        00:00:00 /bin/sh /snap/lxd/23270/commands/daemon.start
root       17162       1  1 Dec20 ?        02:30:54 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
root       17468   15992  8 Dec20 ?        11:56:34 lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd

After I kill the lxd process (17468), its state became ONLINE. I wonder why lxd process was running but not listening on 8443.

Steps to reproduce

Step one
Step two
Step three

Information to attach

[ ] Any relevant kernel output (dmesg)
[ ] Container log (lxc info NAME --show-log)
[ ] Container configuration (lxc config show NAME --expanded)

[x] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)

[root@KS-14 ~]# cat /var/snap/lxd/common/lxd/logs/lxd.log
time="2022-12-20T12:00:20+08:00" level=warning msg="AppArmor support has been disabled because of lack of kernel support"
time="2022-12-20T12:00:21+08:00" level=warning msg=" - AppArmor support has been disabled, Disabled because of lack of kernel support"
time="2022-12-20T12:00:22+08:00" level=warning msg="Dqlite: attempt 1: server 10.2.4.10:8443: no known leader"
time="2022-12-20T12:00:22+08:00" level=warning msg="Dqlite: attempt 1: server 10.2.4.11:8443: no known leader"
time="2022-12-20T12:00:22+08:00" level=warning msg="Dqlite: attempt 1: server 10.2.4.12:8443: no known leader"
time="2022-12-20T12:00:22+08:00" level=warning msg="Dqlite: attempt 1: server 10.2.4.13:8443: no known leader"
time="2022-12-20T12:00:22+08:00" level=warning msg="Dqlite: attempt 1: server 10.2.4.14:8443: no known leader"
time="2022-12-20T12:00:24+08:00" level=warning msg="Failed to initialize fanotify, falling back on fsnotify" err="Failed to initialize fanotify: invalid argument"

[ ] Output of the client with --debug
[ ] Output of the daemon with --debug (alternatively output of lxc monitor while reproducing the issue)

tomponline commented 1 year ago

Are all the cluster members running exactly the same version?

tomponline commented 1 year ago

Also can you confirm issue still exists with lxd 5.9 as 5.3 isnt a supported version.

sleepymole commented 1 year ago

Are all the cluster members running exactly the same version?

Yes, all the cluster members are running the same version.

Also can you confirm issue still exists with lxd 5.9 as 5.3 isnt a supported version.

OK, I will try to upgrade to lxd 5.9 to confirm the issue still exists. But I still have some questions:

Is it expected or possible that lxd is running but not listening on the service port? If not, is there any known fix for it?
How long will a version be supported? Is there a LTS version?

tomponline commented 1 year ago

Most likely different versions running

tomponline commented 1 year ago

But not something ive seen otherwise.

tomponline commented 1 year ago

Lxd 5.0.x is the LTS series. But you cannot switch to the LTS from lxd 5.x

sleepymole commented 1 year ago

Lxd 5.0.x is the LTS series. But you cannot switch to the LTS from lxd 5.x

Oh, I forgot I skipped the LTS version before 🤣.

sleepymole commented 1 year ago

I met another case. The result of cluster list shows all nodes are online.

+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| NAME  |          URL           |      ROLES       | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE  |      MESSAGE      |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-11 | https://10.2.4.11:8443 | database-standby | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-12 | https://10.2.4.10:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-13 | https://10.2.4.9:8443  |                  | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-14 | https://10.2.4.8:8443  |                  | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-15 | https://10.2.4.7:8443  | database         | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-16 | https://10.2.4.5:8443  | database-leader  | x86_64       | default        |             | ONLINE | Fully operational |
|       |                        | database         |              |                |             |        |                   |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-17 | https://10.2.4.6:8443  |                  | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-18 | https://10.2.4.4:8443  |                  | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-19 | https://10.2.4.3:8443  |                  | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-20 | https://10.2.4.2:8443  | database         | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-21 | https://10.2.4.12:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-22 | https://10.2.4.13:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-23 | https://10.2.4.14:8443 | database-standby | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+

However, node ks-16 lost the listener on port 8443. But existing connections are still alive.

I also checked the related code on branch master and lxd-5.3, only if Down is called or address is updated, the listener can be closed. But there were no related logs that indicate Down is called. https://github.com/lxc/lxd/blob/c7950958caebfcb76e612913d107543bba10a739/lxd/endpoints/endpoints.go#L367 https://github.com/lxc/lxd/blob/c7950958caebfcb76e612913d107543bba10a739/lxd/daemon.go#L1791

Although version 5.3 is no longer supported, I'm still interested in finding out the root cause of this issue. I suspect that this issue may exist in the latest branch.

LXD Logs

ks-16 logs

time="2023-01-03T12:40:12+08:00" level=warning msg="Failed adding member event listener client" err="read tcp 10.2.4.5:35614->10.2.4.11:8443: i/o timeout" local="10.2.4.5:8443" remote="10.2.4.11:8443"
time="2023-01-03T12:40:12+08:00" level=warning msg="Heartbeat round duration greater than heartbeat interval" duration=11.376064527s interval=10s
time="2023-01-03T12:40:24+08:00" level=warning msg="Failed adding member event listener client" err="read tcp 10.2.4.5:35692->10.2.4.11:8443: i/o timeout" local="10.2.4.5:8443" remote="10.2.4.11:8443"
time="2023-01-03T12:40:24+08:00" level=warning msg="Heartbeat round duration greater than heartbeat interval" duration=12.030736317s interval=10s
time="2023-01-03T12:40:35+08:00" level=warning msg="Failed adding member event listener client" err="read tcp 10.2.4.5:35732->10.2.4.11:8443: i/o timeout" local="10.2.4.5:8443" remote="10.2.4.11:8443"
time="2023-01-03T12:40:35+08:00" level=warning msg="Heartbeat round duration greater than heartbeat interval" duration=11.656190389s interval=10s
time="2023-01-03T12:40:47+08:00" level=warning msg="Failed adding member event listener client" err="read tcp 10.2.4.5:35778->10.2.4.11:8443: i/o timeout" local="10.2.4.5:8443" remote="10.2.4.11:8443"
time="2023-01-03T12:40:47+08:00" level=warning msg="Heartbeat round duration greater than heartbeat interval" duration=11.713733848s interval=10s
time="2023-01-03T12:40:58+08:00" level=warning msg="Failed adding member event listener client" err="read tcp 10.2.4.5:35824->10.2.4.11:8443: i/o timeout" local="10.2.4.5:8443" remote="10.2.4.11:8443"
time="2023-01-03T12:40:58+08:00" level=warning msg="Heartbeat round duration greater than heartbeat interval" duration=10.538145053s interval=10s

Most logs were repeated.

10.2.4.11 is the ip address of node ks-11, it became offline several days ago. I restarted it, it seems it was unable to connect ks-16 (database-leader).

ks-11 logs

time="2023-01-03T12:45:01+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.13:8443: no known leader"
time="2023-01-03T12:45:01+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.14:8443: no known leader"
time="2023-01-03T12:45:01+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.2:8443: reported leader unavailable err=dial: Failed connecting to HTTP endpoint \"10.2.4.5:8443\": dial tcp 10.2.4.5:8443: connect: connection refused"
time="2023-01-03T12:45:01+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.3:8443: no known leader"
time="2023-01-03T12:45:02+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.4:8443: no known leader"
time="2023-01-03T12:45:02+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.5:8443: dial: Failed connecting to HTTP endpoint \"10.2.4.5:8443\": dial tcp 10.2.4.5:8443: connect: connection refused"
time="2023-01-03T12:45:02+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.6:8443: no known leader"
time="2023-01-03T12:45:02+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.7:8443: reported leader unavailable err=dial: Failed connecting to HTTP endpoint \"10.2.4.5:8443\": dial tcp 10.2.4.5:8443: connect: connection refused"
time="2023-01-03T12:45:02+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.8:8443: no known leader"
time="2023-01-03T12:45:02+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.9:8443: no known leader"

10.2.4.5 is the ip address of ks-16. Since it doesn't listen on 8443, ks-11 can't connect to it.

tomponline commented 1 year ago

Please show lxc config show for the member that doesn't listen on any remote port.

sleepymole commented 1 year ago

This is the result of lxc config show:

[root@KS-16 ~]# lxc config show
config:
  cluster.https_address: 10.2.4.5:8443
  core.https_address: 10.2.4.5:8443
[root@KS-16 ~]#

tomponline commented 1 year ago

Please show sudo ss -tlpn | grep lxd from that host.

sleepymole commented 1 year ago

ss -tlpn | grep lxd

There is no result.

tomponline commented 1 year ago

OK thanks, what about sudo ss -tlpn | grep 8443?

sleepymole commented 1 year ago

It's also no result. I can even use nc to listen on 8443.

tomponline commented 1 year ago

I would suggest killing LXD process (PID 78158) so its restarted by snap/systemd or rebooting that host.

tomponline commented 1 year ago

Hopefully that will get it to relisten and restore the cluster, and then we would want to see the same issue occurring on LXD 5.9 once the upgrade has been completed.

sleepymole commented 1 year ago

I would suggest killing LXD process (PID 78158) so its restarted by snap/systemd or rebooting that host.

Yes, killing really works. But this issue occurred many times.

tomponline commented 1 year ago

OK great, it'll be interesting to see if this occurs in LXD 5.9. Thanks

sleepymole commented 1 year ago

Ok, I will upgrade the cluster soon.

tomponline commented 1 year ago

Thanks, I'll close this for now, if you repost if it re-occurs and and ill reopen. Thanks.

canonical / lxd

Member can not auto recover from network issues #11237

Required information

Issue description

Steps to reproduce

Information to attach

LXD Logs

ks-16 logs

ks-11 logs