canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.37k stars 930 forks source link

LXD Cluster crashes and API unresponsiveness likely due to latency #8714

Closed benaryorg closed 3 years ago

benaryorg commented 3 years ago

Required information

config:
  cluster.https_address: […]:8443
  cluster.images_minimal_replica: "2"
  cluster.max_standby: "3"
  core.https_address: '[::]:8443'
  core.trust_password: true
  images.auto_update_interval: "0"
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - […]:8443
  - '[[…]]:8443'
  - '[[…]]:8443'
  architectures:
  - x86_64
  - i686
  certificate: […]
  certificate_fingerprint: […]
  driver: lxc
  driver_version: 4.0.6
  firewall: xtables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.11.16
  lxc_features:
    cgroup2: "true"
    devpts_fd: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Gentoo
  os_version: ""
  project: default
  server: lxd
  server_clustered: true
  server_name: lxd2
  server_pid: 2439
  server_version: "4.13"
  storage: btrfs
  storage_version: 5.11.1

Issue description

My cluster, both with 3 or 4 nodes consistently has the LXD daemon crash, without any relevant logs whatsoever¹ other than the failed heartbeats whenever one daemon crashes. The API is unresponsive, i.e. takes minutes to respond to lxc list --fast even without any containers, only on some nodes though.

Considering the responses in #6119 as well as some forum reply this might be linked to latency.

All nodes have reasonably low latency of either around 1ms or 80ms, depending on the datacentre.

¹: currently running again with --debug since I can't be 100% sure I did run with that flag last time

Steps to reproduce

So far I could only recreate this with this set of servers or subsets of it (as I don't have access to too many resources right now).

The current cluster layout looks vaguely like this:

name url database state message architecture failure domain
lxd2 https://lxd2.[…]:8443 YES ONLINE fully operational x86_64 ovh-gra
lxd3 https://lxd3.[…]:8443 NO ONLINE fully operational x86_64 ovh-bhs
lxd4 https://lxd4.[…]:8443 YES ONLINE fully operational x86_64 netcup
lxd5 https://lxd5.[…]:8443 YES ONLINE fully operational x86_64 ovh-bhs

No special settings or anything for that matter since the issue arises even without any containers, so neither storage nor network setup should be required. Nodes each have their own FQDN with both A and AAAA records and have a publicly accessible API running there; basically the same setup you'd have in a LAN but with public IPv6 addresses. Each node also has some additional IPv6 setup (ndppd and radvd), though this does neither impact UDP nor TCP connectivity for any of the hosts IP addresses.

Note: hard crashes because of latency are a bug Note: API and main cluster functions requiring very low-latency connections is a major issue for some setups and therefore must be documented accordingly, otherwise it is a bug

Information to attach

stgraber commented 3 years ago

@MathieuBordere what are the requirements as far as latency in dqlite? It crashing completely seems a bit extreme, I'd just expect it to be horrible to use.

MathieuBordere commented 3 years ago

There are no hard requirements wrt latency AFAIK except for the election timeout. This timeout is by default 3000 ms, if the rtt would frequently rise above 3000 ms the cluster will have issues electing a leader and making progress. I agree 100% that the cluster should not crash under any circumstances in the case of large latency.

stgraber commented 3 years ago

@benaryorg can you get us the stack trace that's printed when it crashes, that would let us confirm it's an issue with dqlite.

stgraber commented 3 years ago

Also as you're not using the snap package, you should make sure your systems have an up to date libraft and libdqlite as we fixed a lot of bugs in them lately.

benaryorg commented 3 years ago

can you get us the stack trace that's printed when it crashes, that would let us confirm it's an issue with dqlite.

As far as I can see there is no stacktrace in any of the usual logs (messages, dmesg, lxd/lxd.log, etc.), is there any option akin to --print-goroutines-every 5 which'd cause LXD to print a stacktrace?


Also as you're not using the snap package, you should make sure your systems have an up to date libraft and libdqlite as we fixed a lot of bugs in them lately.

The ebuild used is this one², pulling in raft and dqlite:

  dev-libs/dqlite-1.6.0 pulled in by:
    app-emulation/lxd-4.13 requires dev-libs/dqlite

  dev-libs/raft-0.10.0 pulled in by:
    app-emulation/lxd-4.13 requires dev-libs/raft
    dev-libs/dqlite-1.6.0 requires dev-libs/raft

These seem to be the most current versions, looking at the GitHub releases?

²: 4.13 is built with the same file named lxd-4.13.ebuild which automatically uses the corresponding version where relevant as far as I can see.


Also I've finished ¹ from the first message. Nothing in the logs and lxc monitor looks like this:

location: lxd3
metadata:
  context:
    name: "0"
  level: dbug
  message: Found cert
timestamp: "2021-04-26T10:20:00.759849201Z"
type: logging

location: lxd3
metadata:
  context: {}
  level: dbug
  message: Replace current raft nodes with [{ID:5 Address:lxd5.[…]:8443
    Role:voter} {ID:1 Address:lxd2.[…]:8443 Role:voter} {ID:2 Address:lxd4.[…]:8443
    Role:voter} {ID:4 Address:lxd3.[…]:8443 Role:stand-by}]
timestamp: "2021-04-26T10:20:00.760174744Z"
type: logging

location: lxd4
metadata:
  context: {}
  level: dbug
  message: Successful heartbeat for lxd5.[…]:8443
timestamp: "2021-04-26T10:20:00.866125432Z"
type: logging

location: lxd4
metadata:
  context: {}
  level: dbug
  message: Successful heartbeat for lxd5.[…]:8443
timestamp: "2021-04-26T10:20:00.866125432Z"
type: logging

location: lxd4
metadata:
  context: {}
  level: dbug
  message: Successful heartbeat for lxd3.[…]:8443
timestamp: "2021-04-26T10:20:00.884889115Z"
type: logging

location: lxd4
metadata:
  context: {}
  level: dbug
  message: Successful heartbeat for lxd3.[…]:8443
timestamp: "2021-04-26T10:20:00.884889115Z"
type: logging

Error: websocket: close 1006 (abnormal closure): unexpected EOF

And the previous ~100 entries look the same, i.e. lots of heartbeats sent, received, and some raft updates, but nothing else, or at least it's not easily found. I'll copy the output to a file and dissect it a bit more in a bit, but so far it looks like the process just died all of a sudden.

benaryorg commented 3 years ago

It was also pointed out to me that LXD does not expose dqlite_node_set_network_latency() in any way to the user/admin, who could then provide layer 8 derived values, which might help dqlite in such situations:

https://github.com/canonical/dqlite/blob/df96115a8be386a3afe2d8f356a8158b37a4d87b/include/dqlite.h#L114-L126

stgraber commented 3 years ago

@benaryorg try running lxd --debug --group lxd directly by hand, something should show up when the process crashes and exits.

benaryorg commented 3 years ago

try running lxd --debug --group lxd directly by hand

Here are the last few lines of output. The last line is being logged by zsh as far as I know, which would explain why my logfiles didn't contain anything either; there just wasn't any output to log.

DBUG[04-27|18:47:08] Heartbeat updating local raft nodes to [{ID:1 Address:lxd2.[…]:8443 Role:voter} {ID:2 Address:lxd4.[…]:8443 Role:stand-by} {ID:4 Address:lxd3.[…]:8443 Role:voter} {ID:5 Address:lxd5.[…]:8443 Role:voter}]
DBUG[04-27|18:47:10] Sending heartbeat to lxd5.[…]:8443
DBUG[04-27|18:47:10] Sending heartbeat request to lxd5.[…]:8443
DBUG[04-27|18:47:10] Sending heartbeat to lxd2.[…]:8443
DBUG[04-27|18:47:10] Sending heartbeat request to lxd2.[…]:8443
DBUG[04-27|18:47:10] Successful heartbeat for lxd5.[…]:8443
DBUG[04-27|18:47:11] Successful heartbeat for lxd2.[…]:8443
DBUG[04-27|18:47:15] Sending heartbeat to lxd4.[…]:8443
DBUG[04-27|18:47:15] Sending heartbeat request to lxd4.[…]:8443
DBUG[04-27|18:47:15] Successful heartbeat for lxd4.[…]:8443
zsh: segmentation fault  sudo lxd --group lxd --debug

This is the output of lxd3, which was the only one to crash this time, other times all nodes, or a different subset crashed, the only constant seems to be that something crashes.


I'm running the entire cluster again with gdb in a moment so at least I'll get a stacktrace (I'm not too familiar with golang, so I can only hope it'll be a useful trace), in addition to setting GOTRACEBACK=2 (found it over there; https://github.com/golang/go/issues/5337).

Bringing the entire command line to:

gdb -q -iex 'set pagination off' -iex "set auto-load safe-path /usr/lib/go/src/runtime/runtime-gdb.py" -ex 'file /usr/sbin/lxd' -ex 'set environment GOTRACEBACK = 2' -ex 'run --debug --group lxd'

Last time it took around 1¾h to crash, so it's not really something I can wait for, which means reports on whether something works or not will usually take some time, just so you know ahead of time that'll probably take a bit.

stgraber commented 3 years ago

Ok, so that's unlikely to be a Go crash and so much more likely to be coming from dqlite directly. Let's see if you can get a backtrace with gdb, that should help @MathieuBordere

benaryorg commented 3 years ago

This one took about 8 hours to crash if I'm not mistaken, but here it is:

DBUG[04-28|17:14:14] Sending heartbeat to lxd5.[…]:8443
DBUG[04-28|17:14:14] Sending heartbeat request to lxd5.[…]:8443
DBUG[04-28|17:14:15] Successful heartbeat for lxd5.[…]:8443
DBUG[04-28|17:14:18] Sending heartbeat to lxd3.[…]:8443
DBUG[04-28|17:14:18] Sending heartbeat request to lxd3.[…]:8443
DBUG[04-28|17:14:19] Successful heartbeat for lxd3.[…]:8443
DBUG[04-28|17:14:20] Sending heartbeat to lxd2.[…]:8443
DBUG[04-28|17:14:20] Sending heartbeat request to lxd2.[…]:8443
DBUG[04-28|17:14:20] Successful heartbeat for lxd2.[…]:8443

Thread 8 "lxd" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc929e640 (LWP 22706)]
0x00007ffff7cd5401 in vfsWalFrameGet (frame=<synthetic pointer>, page_size=4096, index=1, w=0x7fffac008ea8) at src/vfs.c:450
450     src/vfs.c: No such file or directory.
(gdb) bt
#0  0x00007ffff7cd5401 in vfsWalFrameGet (frame=<synthetic pointer>, page_size=4096, index=1, w=0x7fffac008ea8) at src/vfs.c:450
#1  vfsWalWrite (offset=32, amount=24, buf=0x7fffc929a230, w=0x7fffac008ea8) at src/vfs.c:1033
#2  vfsFileWrite (file=<optimized out>, buf=0x7fffc929a230, amount=24, offset=32) at src/vfs.c:1082
#3  0x00007ffff7d6313b in walWriteOneFrame.isra.0 () from /usr/lib64/libsqlite3.so.0
#4  0x00007ffff7dca111 in pagerWalFrames () from /usr/lib64/libsqlite3.so.0
#5  0x00007ffff7dd6431 in sqlite3PagerCommitPhaseOne () from /usr/lib64/libsqlite3.so.0
#6  0x00007ffff7dd68d3 in sqlite3BtreeCommitPhaseOne.part.0 () from /usr/lib64/libsqlite3.so.0
#7  0x00007ffff7ddc381 in sqlite3VdbeHalt () from /usr/lib64/libsqlite3.so.0
#8  0x00007ffff7e0c423 in sqlite3VdbeExec () from /usr/lib64/libsqlite3.so.0
#9  0x00007ffff7e136c0 in sqlite3_step () from /usr/lib64/libsqlite3.so.0
#10 0x00007ffff7ccf5fc in leaderExecV2 (req=0x7fffac430988) at src/leader.c:344
#11 execBarrierCb (barrier=<optimized out>, status=<optimized out>) at src/leader.c:381
#12 0x00007ffff7ccfce6 in leader__barrier (l=<optimized out>, barrier=0x7fffac430998, cb=0x7ffff7ccf550 <execBarrierCb>) at src/leader.c:426
#13 0x00007ffff7ccd642 in handle_exec_sql_next (req=req@entry=0x7fffac430ac0, cursor=cursor@entry=0x7fffc929a990) at src/gateway.c:441
#14 0x00007ffff7cce757 in handle_exec_sql (cursor=0x7fffc929a990, req=0x7fffac430ac0) at src/gateway.c:476
#15 gateway__handle (g=g@entry=0x7fffac430950, req=req@entry=0x7fffac430ac0, type=8, cursor=cursor@entry=0x7fffc929a990, buffer=buffer@entry=0x7fffac430a90, cb=cb@entry=0x7ffff7ccb830 <gateway_handle_cb>) at src/gateway.c:974
#16 0x00007ffff7ccbabb in read_request_cb (transport=<optimized out>, status=<optimized out>) at src/conn.c:143
#17 0x00007ffff76932f9 in uv.read () from /usr/lib64/libuv.so.1
#18 0x00007ffff7693f28 in uv.stream_io () from /usr/lib64/libuv.so.1
#19 0x00007ffff7699e93 in uv.io_poll () from /usr/lib64/libuv.so.1
#20 0x00007ffff7688d34 in uv_run () from /usr/lib64/libuv.so.1
#21 0x00007ffff7cd2388 in taskRun (d=0x25f0cb0) at src/server.c:499
#22 taskStart (arg=0x25f0cb0) at src/server.c:518
#23 0x00007ffff7ec2e5e in start_thread () from /lib64/libpthread.so.0
#24 0x00007ffff7be3d1f in clone () from /lib64/libc.so.6

So if I see that correctly it is segfaulting in sqlite3 during what looks like an iteration over results using a cursor, maybe a race-condition with the sync that causes a stale cursor (hence only happening when there's latency)? Again, I have not the slightest hint of a clue about the inner workings of any of sqlite, dqlite, or LXD.

I'll dig into it some more tomorrow, and provide the coredump etc..

There also seems to be an issue (or at least to me it looks like that) with nodes when the client (i.e. the other LXD instance) unexpectedly crashes during the TLS handshake. Normally this wouldn't happen, but in this case a crash of one instance also crashed two others which probably shouldn't happen either?

2021/04/28 21:28:56 http: TLS handshake error from […]:41526: EOF
2021/04/28 21:28:58 http: TLS handshake error from […]:43280: tls: client offered only unsupported versions: []

Thread 16 "lxd" received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7fffa5fc7640 (LWP 23780)]
syscall.Syscall () at /usr/lib/go/src/syscall/asm_linux_amd64.s:26
26              CMPQ    AX, $0xfffffffffffff001
(gdb) bt
#0  syscall.Syscall () at /usr/lib/go/src/syscall/asm_linux_amd64.s:26
#1  0x0000000000493cba in syscall.write (fd=60, p=[]uint8, n=<optimized out>, err=...) at /usr/lib/go/src/syscall/zsyscall_linux_amd64.go:914
#2  0x00000000004fecda in syscall.Write (fd=60, p=[]uint8 = {...}, n=<optimized out>, err=<error reading variable: access outside bounds of object referenced via synthetic pointer>) at /usr/lib/go/src/syscall/syscall_unix.go:212
#3  internal/poll.ignoringEINTRIO (fd=60, p=[]uint8 = {...}, fn=<optimized out>) at /usr/lib/go/src/internal/poll/fd_unix.go:581
#4  internal/poll.(*FD).Write (fd=0xc000290600, p=..., ~r1=<optimized out>, ~r2=...) at /usr/lib/go/src/internal/poll/fd_unix.go:274
#5  0x00000000005b4aef in net.(*netFD).Write (fd=0xc000290600, p=..., nn=<optimized out>, err=...) at /usr/lib/go/src/net/fd_posix.go:73
#6  0x00000000005c648e in net.(*conn).Write (c=0xc000010518, b=..., ~r1=<optimized out>, ~r2=...) at /usr/lib/go/src/net/net.go:195
#7  0x0000000000625ec2 in crypto/tls.(*Conn).write (c=0xc000186380, data=..., ~r1=<optimized out>, ~r2=...) at /usr/lib/go/src/crypto/tls/conn.go:911
#8  0x0000000000626372 in crypto/tls.(*Conn).writeRecordLocked (c=0xc000186380, typ=21 '\025', data=<error reading variable: access outside bounds of object referenced via synthetic pointer>, ~r2=<optimized out>, ~r3=...) at /usr/lib/go/src/crypto/tls/conn.go:979
#9  0x00000000006257ac in crypto/tls.(*Conn).sendAlertLocked (c=0xc000186380, err=0 '\000', ~r1=...) at /usr/lib/go/src/crypto/tls/conn.go:812
#10 0x0000000000628eb0 in crypto/tls.(*Conn).closeNotify (c=0xc000186380, ~r0=...) at /usr/lib/go/src/crypto/tls/conn.go:1361
#11 0x0000000000628c37 in crypto/tls.(*Conn).Close (c=0xc000186380, ~r0=...) at /usr/lib/go/src/crypto/tls/conn.go:1330
#12 0x00000000006eaea2 in net/http.(*conn).close (c=0xc0005881e0) at /usr/lib/go/src/net/http/server.go:1705
#13 0x000000000070f794 in net/http.(*conn).serve.func1 (c=0xc0005881e0) at /usr/lib/go/src/net/http/server.go:1828
#14 0x00000000006ec4b6 in net/http.(*conn).serve (c=0xc0005881e0, ctx=<error reading variable: access outside bounds of object referenced via synthetic pointer>) at /usr/lib/go/src/net/http/server.go:1911
#15 0x000000000047e141 in runtime.goexit () at /usr/lib/go/src/runtime/asm_amd64.s:1371
#16 0x000000c0005881e0 in ?? ()
#17 0x00000000017e4728 in timerCtx,context.Context ()
#18 0x000000c000216bc0 in ?? ()
#19 0x0000000000000000 in ?? ()
stgraber commented 3 years ago

@MathieuBordere does that help?

MathieuBordere commented 3 years ago

@stgraber Yes, thanks @benaryorg, looking into it.

MathieuBordere commented 3 years ago

@benaryorg I released new versions of both libdqlite (https://github.com/canonical/dqlite/releases/tag/v1.7.0) & libraft (https://github.com/canonical/raft/releases/tag/v0.10.1) containing various bugfixes. Can you try to reproduce again with these versions please? Much appreciated!

benaryorg commented 3 years ago

Seems I've read over half the message in my morning tiredness and only updated libdqlite to 1.7.0, not raft though. So far the cluster was running for 5h+ without incident, the unresponsiveness persisted though.

Past tense, since I just updated raft to 0.10.1. The unresponsiveness persists; runtimes for lxc list --fast for the cluster which has no containers are as follows:

node region database response time
lxd2 ovh-gra (EU) true 0.440s
lxd3 ovh-bhs (NA) false 6.916s
lxd4 netcup (EU) true 0.014s
lxd5 ovh-bhs (NA) true 5.539s

I'll report back as soon as I have more infos on the crashes, or if it doesn't crash within something around 20h or so.

benaryorg commented 3 years ago

The cluster has been up for 20h straight with no segmentation faults. At this point I also started creating containers, which seems to be working as well.


However the API is still severely unresponsive.

Considering that on a different set of hosts with ~85ms latency, which are not clustered, I can request /1.0/instances/{name}/state and get a response within 0.692s these times are rather frightening:

node region database response time
lxd2 ovh-gra (EU) true 2.338s
lxd3 ovh-bhs (NA) false 5.992s
lxd4 netcup (EU) true 1.612s
lxd5 ovh-bhs (NA) true 7.640s

The command was as follows, major-toad being a container running on lxd2:

time lxc query "/1.0/instances/major-toad/state"

Not to mention that container creation on a host other than lxd2 or lxd4 takes somewhat around 10-20 minutes, which seems a bit excessive.

stgraber commented 3 years ago

Might be interesting to do time lxd sql global "SELECT * FROM instances;" from each of the servers.

stgraber commented 3 years ago

The query above will need to at least do:

This can easily expand to dozen of DB queries, each taking multiple packets, each with your 80ms latency, so in the worst case scenario where you're querying a server other than the one running the instance AND your database leader is on a server far away, you could find yourself with hundreds of packets being sent to various cluster nodes, so it's easy to get to high latencies.

LXD cluster is generally expected to be run within the same rack or at least within the same datacenter/region. That's not to say we're not interested in improving performance though, if there are obvious wastes happening especially at the DB query layer, fixing them will improve performance significantly for you and still reduce load for others.

benaryorg commented 3 years ago

time lxd sql global "SELECT * FROM instances;" on all nodes (at the same time):

node region database response time
lxd2 ovh-gra (EU) true 0.092s
lxd3 ovh-bhs (NA) false 0.394s
lxd4 netcup (EU) true 0.030s
lxd5 ovh-bhs (NA) true 1.658s

LXD cluster is generally expected to be run within the same rack

I assumed so when I read various blog posts, however I thought that especially in my case it couldn't possibly be performing that badly since there are barely any write operations (in terms of container creation or migration). I'll elaborate a little on my use case, in case it helps in any way, or you have any ideas on how to workaround the latency. Currently I'm building off of the idea of having IPv6 networks with SLAAC on every host (with their own /64 each), and assigning FQDNs to every container so that even when they are migrated and their IP changes, they are still reachable via DNS. That way I get rid of any software dependency when it comes to the raw networking, i.e. no masquerading, no VPN, and SDN. However the DNS part is what's bugging me right now, since my idea originally was to start a powerdns server on each host node with NS entries for each of them on a subdomain and providing a pipe-command to query the LXD API for the container's address, returning all scope=global and family=inet6 addresses as AAAA records. On top of that some xinetd, nginx, or relayd magic to proxy the IPv4 request to IPv6 (LXD's forwarding would only attach it on one host node, I'd want it on all host nodes so migrations don't break anything).

So the bulk of what LXD would do is keep track of the containers, and occasionally allow migration or execution, and thus I thought very little writes would happen on dqlite and performance should be well, even with high latency. I'd prefer to keep the cluster rather than separate remotes since it enforces name-uniqueness, as well as only having to do one request against the cluster rather than query every node (possibly parallel) to determine which node the container resides on and what IPv6 address it has.

Currently my alternatives look very slim, since there's barely anything but LXD that works with system containers rather than applications (as does k8s for example). I'm thinking of repurposing the nodes as a CEPH cluster for the geo-redundancy and spanning across DCs, and building a small LXD cluster in a single DC, where the cluster is very well automated so in case of simultaneous failure it can be quickly rebuilt and no data would be lost thanks to CEPH's redundancy, but that's a different matter altogether.

stgraber commented 3 years ago

So the issue here is the assumption that only DB writes incur latency. In order to get consistent reads, all DB access read and write is sent to the leader. You do save some latency on reads in that they don't cause data to be replicated, but they also aren't served from the local server unless you happen to be on the leader.

Based on what you showed above, your DB leader is lxd4 with lxd2 and lxd5 being voters. lxd2 has low latency access to lxd4 so its query latency was reasonable. The other two need to cross an ocean to make the same query so get much higher latency due to the roundtrips. It's a bit odd that one of the two is getting so much higher latency though considering they're both coming from bhs.

I'd definitely not recommend an inter-continental cluster and I wouldn't even recommend a multi-site/multi-region cluster, though with the latencies above it looks like you'd get a much more usable user experience if you had one cluster for North America and one for Europe. Not sure how easy it would be for you to try something like that or if that'd even be useful.

stgraber commented 3 years ago

Closing the issue as I don't think there's anything strictly actionable here.

The dqlite fixes seem to have fixed the stability issues you first described and now you're pretty much just dealing with latency piling up. As mentioned above, we're unlikely to be spending much time in fixing latency related issues when dealing with a cluster across two continents, but issues related to lower latency scenarios (10ms range I guess) we'd have much more interest in looking at.

That's absolutely not to say we wouldn't be happy to help review and merge improvements for high latency environments. It's not something I'd expect us to fully support as the target of LXD cluster is a single rack or a set of instances within a cloud region but improvements are improvements and what helps a high latency environment will also improve a low latency one.