Closed sleepymole closed 1 year ago
Are all the cluster members running exactly the same version?
Also can you confirm issue still exists with lxd 5.9 as 5.3 isnt a supported version.
Are all the cluster members running exactly the same version?
Yes, all the cluster members are running the same version.
Also can you confirm issue still exists with lxd 5.9 as 5.3 isnt a supported version.
OK, I will try to upgrade to lxd 5.9 to confirm the issue still exists. But I still have some questions:
Most likely different versions running
But not something ive seen otherwise.
Lxd 5.0.x is the LTS series. But you cannot switch to the LTS from lxd 5.x
Lxd 5.0.x is the LTS series. But you cannot switch to the LTS from lxd 5.x
Oh, I forgot I skipped the LTS version before 🤣.
I met another case. The result of cluster list
shows all nodes are online.
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| NAME | URL | ROLES | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE | MESSAGE |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-11 | https://10.2.4.11:8443 | database-standby | x86_64 | default | | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-12 | https://10.2.4.10:8443 | | x86_64 | default | | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-13 | https://10.2.4.9:8443 | | x86_64 | default | | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-14 | https://10.2.4.8:8443 | | x86_64 | default | | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-15 | https://10.2.4.7:8443 | database | x86_64 | default | | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-16 | https://10.2.4.5:8443 | database-leader | x86_64 | default | | ONLINE | Fully operational |
| | | database | | | | | |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-17 | https://10.2.4.6:8443 | | x86_64 | default | | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-18 | https://10.2.4.4:8443 | | x86_64 | default | | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-19 | https://10.2.4.3:8443 | | x86_64 | default | | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-20 | https://10.2.4.2:8443 | database | x86_64 | default | | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-21 | https://10.2.4.12:8443 | | x86_64 | default | | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-22 | https://10.2.4.13:8443 | | x86_64 | default | | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ks-23 | https://10.2.4.14:8443 | database-standby | x86_64 | default | | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
However, node ks-16
lost the listener on port 8443. But existing connections are still alive.
I also checked the related code on branch master and lxd-5.3
, only if Down
is called or address is updated, the listener can be closed. But there were no related logs that indicate Down
is called. https://github.com/lxc/lxd/blob/c7950958caebfcb76e612913d107543bba10a739/lxd/endpoints/endpoints.go#L367
https://github.com/lxc/lxd/blob/c7950958caebfcb76e612913d107543bba10a739/lxd/daemon.go#L1791
Although version 5.3 is no longer supported, I'm still interested in finding out the root cause of this issue. I suspect that this issue may exist in the latest branch.
time="2023-01-03T12:40:12+08:00" level=warning msg="Failed adding member event listener client" err="read tcp 10.2.4.5:35614->10.2.4.11:8443: i/o timeout" local="10.2.4.5:8443" remote="10.2.4.11:8443"
time="2023-01-03T12:40:12+08:00" level=warning msg="Heartbeat round duration greater than heartbeat interval" duration=11.376064527s interval=10s
time="2023-01-03T12:40:24+08:00" level=warning msg="Failed adding member event listener client" err="read tcp 10.2.4.5:35692->10.2.4.11:8443: i/o timeout" local="10.2.4.5:8443" remote="10.2.4.11:8443"
time="2023-01-03T12:40:24+08:00" level=warning msg="Heartbeat round duration greater than heartbeat interval" duration=12.030736317s interval=10s
time="2023-01-03T12:40:35+08:00" level=warning msg="Failed adding member event listener client" err="read tcp 10.2.4.5:35732->10.2.4.11:8443: i/o timeout" local="10.2.4.5:8443" remote="10.2.4.11:8443"
time="2023-01-03T12:40:35+08:00" level=warning msg="Heartbeat round duration greater than heartbeat interval" duration=11.656190389s interval=10s
time="2023-01-03T12:40:47+08:00" level=warning msg="Failed adding member event listener client" err="read tcp 10.2.4.5:35778->10.2.4.11:8443: i/o timeout" local="10.2.4.5:8443" remote="10.2.4.11:8443"
time="2023-01-03T12:40:47+08:00" level=warning msg="Heartbeat round duration greater than heartbeat interval" duration=11.713733848s interval=10s
time="2023-01-03T12:40:58+08:00" level=warning msg="Failed adding member event listener client" err="read tcp 10.2.4.5:35824->10.2.4.11:8443: i/o timeout" local="10.2.4.5:8443" remote="10.2.4.11:8443"
time="2023-01-03T12:40:58+08:00" level=warning msg="Heartbeat round duration greater than heartbeat interval" duration=10.538145053s interval=10s
Most logs were repeated.
10.2.4.11 is the ip address of node ks-11, it became offline several days ago. I restarted it, it seems it was unable to connect ks-16 (database-leader).
time="2023-01-03T12:45:01+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.13:8443: no known leader"
time="2023-01-03T12:45:01+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.14:8443: no known leader"
time="2023-01-03T12:45:01+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.2:8443: reported leader unavailable err=dial: Failed connecting to HTTP endpoint \"10.2.4.5:8443\": dial tcp 10.2.4.5:8443: connect: connection refused"
time="2023-01-03T12:45:01+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.3:8443: no known leader"
time="2023-01-03T12:45:02+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.4:8443: no known leader"
time="2023-01-03T12:45:02+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.5:8443: dial: Failed connecting to HTTP endpoint \"10.2.4.5:8443\": dial tcp 10.2.4.5:8443: connect: connection refused"
time="2023-01-03T12:45:02+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.6:8443: no known leader"
time="2023-01-03T12:45:02+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.7:8443: reported leader unavailable err=dial: Failed connecting to HTTP endpoint \"10.2.4.5:8443\": dial tcp 10.2.4.5:8443: connect: connection refused"
time="2023-01-03T12:45:02+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.8:8443: no known leader"
time="2023-01-03T12:45:02+08:00" level=warning msg="Dqlite: attempt 2: server 10.2.4.9:8443: no known leader"
10.2.4.5 is the ip address of ks-16. Since it doesn't listen on 8443, ks-11 can't connect to it.
Please show lxc config show
for the member that doesn't listen on any remote port.
This is the result of lxc config show
:
[root@KS-16 ~]# lxc config show
config:
cluster.https_address: 10.2.4.5:8443
core.https_address: 10.2.4.5:8443
[root@KS-16 ~]#
Please show sudo ss -tlpn | grep lxd
from that host.
ss -tlpn | grep lxd
There is no result.
OK thanks, what about sudo ss -tlpn | grep 8443
?
It's also no result. I can even use nc
to listen on 8443.
I would suggest killing LXD process (PID 78158) so its restarted by snap/systemd or rebooting that host.
Hopefully that will get it to relisten and restore the cluster, and then we would want to see the same issue occurring on LXD 5.9 once the upgrade has been completed.
I would suggest killing LXD process (PID 78158) so its restarted by snap/systemd or rebooting that host.
Yes, killing really works. But this issue occurred many times.
OK great, it'll be interesting to see if this occurs in LXD 5.9. Thanks
Ok, I will upgrade the cluster soon.
Thanks, I'll close this for now, if you repost if it re-occurs and and ill reopen. Thanks.
Required information
lxc info
```go config: cluster.https_address: 10.2.4.8:8443 core.https_address: 10.2.4.8:8443 api_extensions: - storage_zfs_remove_snapshots - container_host_shutdown_timeout - container_stop_priority - container_syscall_filtering - auth_pki - container_last_used_at - etag - patch - usb_devices - https_allowed_credentials - image_compression_algorithm - directory_manipulation - container_cpu_time - storage_zfs_use_refquota - storage_lvm_mount_options - network - profile_usedby - container_push - container_exec_recording - certificate_update - container_exec_signal_handling - gpu_devices - container_image_properties - migration_progress - id_map - network_firewall_filtering - network_routes - storage - file_delete - file_append - network_dhcp_expiry - storage_lvm_vg_rename - storage_lvm_thinpool_rename - network_vlan - image_create_aliases - container_stateless_copy - container_only_migration - storage_zfs_clone_copy - unix_device_rename - storage_lvm_use_thinpool - storage_rsync_bwlimit - network_vxlan_interface - storage_btrfs_mount_options - entity_description - image_force_refresh - storage_lvm_lv_resizing - id_map_base - file_symlinks - container_push_target - network_vlan_physical - storage_images_delete - container_edit_metadata - container_snapshot_stateful_migration - storage_driver_ceph - storage_ceph_user_name - resource_limits - storage_volatile_initial_source - storage_ceph_force_osd_reuse - storage_block_filesystem_btrfs - resources - kernel_limits - storage_api_volume_rename - macaroon_authentication - network_sriov - console - restrict_devlxd - migration_pre_copy - infiniband - maas_network - devlxd_events - proxy - network_dhcp_gateway - file_get_symlink - network_leases - unix_device_hotplug - storage_api_local_volume_handling - operation_description - clustering - event_lifecycle - storage_api_remote_volume_handling - nvidia_runtime - container_mount_propagation - container_backup - devlxd_images - container_local_cross_pool_handling - proxy_unix - proxy_udp - clustering_join - proxy_tcp_udp_multi_port_handling - network_state - proxy_unix_dac_properties - container_protection_delete - unix_priv_drop - pprof_http - proxy_haproxy_protocol - network_hwaddr - proxy_nat - network_nat_order - container_full - candid_authentication - backup_compression - candid_config - nvidia_runtime_config - storage_api_volume_snapshots - storage_unmapped - projects - candid_config_key - network_vxlan_ttl - container_incremental_copy - usb_optional_vendorid - snapshot_scheduling - snapshot_schedule_aliases - container_copy_project - clustering_server_address - clustering_image_replication - container_protection_shift - snapshot_expiry - container_backup_override_pool - snapshot_expiry_creation - network_leases_location - resources_cpu_socket - resources_gpu - resources_numa - kernel_features - id_map_current - event_location - storage_api_remote_volume_snapshots - network_nat_address - container_nic_routes - rbac - cluster_internal_copy - seccomp_notify - lxc_features - container_nic_ipvlan - network_vlan_sriov - storage_cephfs - container_nic_ipfilter - resources_v2 - container_exec_user_group_cwd - container_syscall_intercept - container_disk_shift - storage_shifted - resources_infiniband - daemon_storage - instances - image_types - resources_disk_sata - clustering_roles - images_expiry - resources_network_firmware - backup_compression_algorithm - ceph_data_pool_name - container_syscall_intercept_mount - compression_squashfs - container_raw_mount - container_nic_routed - container_syscall_intercept_mount_fuse - container_disk_ceph - virtual-machines - image_profiles - clustering_architecture - resources_disk_id - storage_lvm_stripes - vm_boot_priority - unix_hotplug_devices - api_filtering - instance_nic_network - clustering_sizing - firewall_driver - projects_limits - container_syscall_intercept_hugetlbfs - limits_hugepages - container_nic_routed_gateway - projects_restrictions - custom_volume_snapshot_expiry - volume_snapshot_scheduling - trust_ca_certificates - snapshot_disk_usage - clustering_edit_roles - container_nic_routed_host_address - container_nic_ipvlan_gateway - resources_usb_pci - resources_cpu_threads_numa - resources_cpu_core_die - api_os - container_nic_routed_host_table - container_nic_ipvlan_host_table - container_nic_ipvlan_mode - resources_system - images_push_relay - network_dns_search - container_nic_routed_limits - instance_nic_bridged_vlan - network_state_bond_bridge - usedby_consistency - custom_block_volumes - clustering_failure_domains - resources_gpu_mdev - console_vga_type - projects_limits_disk - network_type_macvlan - network_type_sriov - container_syscall_intercept_bpf_devices - network_type_ovn - projects_networks - projects_networks_restricted_uplinks - custom_volume_backup - backup_override_name - storage_rsync_compression - network_type_physical - network_ovn_external_subnets - network_ovn_nat - network_ovn_external_routes_remove - tpm_device_type - storage_zfs_clone_copy_rebase - gpu_mdev - resources_pci_iommu - resources_network_usb - resources_disk_address - network_physical_ovn_ingress_mode - network_ovn_dhcp - network_physical_routes_anycast - projects_limits_instances - network_state_vlan - instance_nic_bridged_port_isolation - instance_bulk_state_change - network_gvrp - instance_pool_move - gpu_sriov - pci_device_type - storage_volume_state - network_acl - migration_stateful - disk_state_quota - storage_ceph_features - projects_compression - projects_images_remote_cache_expiry - certificate_project - network_ovn_acl - projects_images_auto_update - projects_restricted_cluster_target - images_default_architecture - network_ovn_acl_defaults - gpu_mig - project_usage - network_bridge_acl - warnings - projects_restricted_backups_and_snapshots - clustering_join_token - clustering_description - server_trusted_proxy - clustering_update_cert - storage_api_project - server_instance_driver_operational - server_supported_storage_drivers - event_lifecycle_requestor_address - resources_gpu_usb - clustering_evacuation - network_ovn_nat_address - network_bgp - network_forward - custom_volume_refresh - network_counters_errors_dropped - metrics - image_source_project - clustering_config - network_peer - linux_sysctl - network_dns - ovn_nic_acceleration - certificate_self_renewal - instance_project_move - storage_volume_project_move - cloud_init - network_dns_nat - database_leader - instance_all_projects - clustering_groups - ceph_rbd_du - instance_get_full - qemu_metrics - gpu_mig_uuid - event_project - clustering_evacuation_live - instance_allow_inconsistent_copy - network_state_ovn - storage_volume_api_filtering - image_restrictions - storage_zfs_export - network_dns_records - storage_zfs_reserve_space - network_acl_log - storage_zfs_blocksize - metrics_cpu_seconds - instance_snapshot_never - certificate_token - instance_nic_routed_neighbor_probe - event_hub - agent_nic_config - projects_restricted_intercept - metrics_authentication - images_target_project - cluster_migration_inconsistent_copy - cluster_ovn_chassis - container_syscall_intercept_sched_setscheduler - storage_lvm_thinpool_metadata_size - storage_volume_state_total - instance_file_head - instances_nic_host_name - image_copy_profile - container_syscall_intercept_sysinfo - clustering_evacuation_mode - resources_pci_vpd - qemu_raw_conf - storage_cephfs_fscache api_status: stable api_version: "1.0" auth: trusted public: false auth_methods: - tls environment: addresses: - 10.2.4.8:8443 architectures: - x86_64 - i686 certificate: |
certificate_fingerprint:
driver: lxc | qemu
driver_version: 5.0.0 | 7.0.0
firewall: xtables
kernel: Linux
kernel_architecture: x86_64
kernel_features:
idmapped_mounts: "false"
netnsid_getifaddrs: "false"
seccomp_listener: "false"
seccomp_listener_continue: "false"
shiftfs: "false"
uevent_injection: "false"
unpriv_fscaps: "true"
kernel_version: 3.10.0-957.el7.x86_64
lxc_features:
cgroup2: "true"
core_scheduling: "true"
devpts_fd: "true"
idmapped_mounts_v2: "true"
mount_injection_file: "true"
network_gateway_device_route: "true"
network_ipvlan: "true"
network_l2proxy: "true"
network_phys_macvlan_mtu: "true"
network_veth_router: "true"
pidfd: "true"
seccomp_allow_deny_syntax: "true"
seccomp_notify: "true"
seccomp_proxy_send_notify_fd: "true"
os_name: CentOS Linux
os_version: "7"
project: tispace
server: lxd
server_clustered: true
server_event_mode: full-mesh
server_name: ks-14
server_pid: 17468
server_version: "5.3"
storage: dir | lvm
storage_version: 1 | 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.37.1
storage_supported_drivers:
- name: btrfs
version: 5.4.1
remote: false
- name: cephfs
version: 15.2.16
remote: true
- name: dir
version: "1"
remote: false
- name: lvm
version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.37.1
remote: false
- name: ceph
version: 15.2.16
remote: true
```
Issue description
From
lxc cluster ls
, a member (ks-14) was offline for about 115hours.I checked the network connection between ks-14 and all other nodes. It seems the network is no problem. I observed that lxd on ks-14 doesn't listen on 8443, so I guess this is why cluster think it is offline. But the lxd process is still running.
After I kill the lxd process (17468), its state became
ONLINE
. I wonder why lxd process was running but not listening on 8443.Steps to reproduce
Information to attach
dmesg
)lxc info NAME --show-log
)lxc config show NAME --expanded
)lxc monitor
while reproducing the issue)