canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 926 forks source link

lxd fails to start after reboot when core.storage_buckets_address is set to a port on the private lxdbr0 ipv4 address #11611

Closed melato closed 1 year ago

melato commented 1 year ago

Issue description

lxd fails to start after reboot when core.storage_buckets_address is set to a port on the private lxdbr0 ipv4 address.

Steps to reproduce

  1. Install LXD on a new server (I used a hetzner image snapshot that had zfs and LXD 5.12, so it was not the latest version). The LXD configuration notably has these configuration values on top of the defaults:
    core.https_address: :8443
    ipv6.dhcp.stateful: true
    1. lxc config set core.storage_buckets_address 10.91.97.1:8555 # 10.91.97.1 is the ip address of lxdbr0.
    2. sudo zfs create z/storage/s -p
    3. lxc storage create s zfs source=z/storage/s
    4. lxc storage bucket create s apple
    5. Use minio mc to perform some operations in the bucket:
      mc --insecure alias set local https://10.91.97.1:8555 <key> <secret>
      date | mc --insecure pipe local/apple/date
      mc --insecure ls local/apple
    6. reboot
    7. lxc list Error: Get "http://unix.socket/1.0": read unix @->/var/snap/lxd/common/lxd-user/unix.socket: read: connection reset by peer
  2. journalctl -u snap.lxd.daemon -n 300 This line was in the output: Error: Bind network address: listen tcp 10.91.97.1:8555: bind: cannot assign requested address

Fixed it like this: sudo sqlite3 /var/snap/lxd/common/lxd/database/local.db sqlite> DELETE FROM config WHERE key='core.storage_buckets_address';

sudo systemctl start snap.lxd.daemon

This was easily reproducible. After the fix I did it again:

lxc config set core.storage_buckets_address 10.91.97.1:8555
reboot

and got the same problem after step 7 above.

Required information

stgraber commented 1 year ago

I've had similar trouble with some of the other listeners. In general, other than the cluster listener, we should be able to have the others issue a warning (both in log and warning API) and keep retrying in the background.

tomponline commented 1 year ago

Reported from https://discuss.linuxcontainers.org/t/lxc-daemon-fails-to-start-if-it-cannot-bind-to-bgp-address/15293/5

melato commented 1 year ago

Could it be that LXD does not wait for lxdbr0 to initialize before attempting to start some listeners on it?

This problem does not happen with core.https_address. I set it to {lxdbr0-ipv4}:8443 and rebooted. lxc was operational. Using "netstat -nat" I saw that LXD started listening to {lxdbr0-ipv4}:8443 several seconds later. I used the same LXD version, as above.

tomponline commented 1 year ago

Yeah listeners are started before networks. But adding retries would solve this also.

kamzar1 commented 1 year ago

I had it with proxy config,

  some_rpc:
    connect: tcp:127.0.0.1:8545
    listen: tcp:192.168.2.40:8545
    type: proxy

After a host reboot, the private IP 192.168.2.40 vanished, as consequence, some containers rejected to start. I rather seen a more forgiving behavior than going through multiple containers' config and start manually.

stgraber commented 1 year ago

I rather seen a more forgiving behavior than going through multiple containers' config and start manually.

We could introduce an optional key on the proxy as is done with many other device types, but that may still cause issues as the container will then be allowed to start with that proxy device missing. LXD doesn't really have a way to keep track of every single little detail on your system, so it's not going to be practical for us to start monitoring IP addresses and port usage on the host system to then start instances.

I believe we already have a background retry logic for instances on startup, so if the instance still can't start after the 30s or whatever retry delay we have, then your system isn't likely to fix itself without human intervention at which point, the human in question can also start the affected instances.