canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

Regression: "error: websocket: close 1006 (abnormal closure): unexpected EOF" #3770

Closed niedbalski closed 7 years ago

niedbalski commented 7 years ago

Required information

Issue description

The script that I am appending next works without issues in 2.12 , but fails to run on 2.17 (xenial-backports) version.

I have bisected the regression and I tracked it down to the following offending commit:

And the subsequent commit:

I built a LXD/LXC from the previous commit https://github.com/lxc/lxd/commit/665f9013f70ae99a92c6c8c8bd08892b5dfd6cbf and the script works well as well as with any other previous version to the offending commits.

start.sh

#!/bin/bash
for x in `seq 1 20`; do 
./reproducer.sh test$x &
done
wait

reproducer.sh

#!/bin/bash

NAME=$1
lxc init ubuntu: $NAME
while pgrep -f 'lxc init' >/dev/null; do
echo "$NAME: wait for all lxc done"
sleep 2
done

lxc start $NAME

while true; do
lxc exec $NAME -- sleep 5
done

Please check the full reproducer description next.

Steps to reproduce

  1. Run lxd init and create a zfs pool as follows
    +---------+-------------+--------+--------------------------------+---------+
    |  NAME   | DESCRIPTION | DRIVER |             SOURCE             | USED BY |
    +---------+-------------+--------+--------------------------------+---------+
    | default |             | zfs    | /var/lib/lxd/disks/default.img | 19      |
    +---------+-------------+--------+--------------------------------+---------+
  2. Run ./start.sh (described previously)
  3. Check for the output of the previous command and wait for the multiple occurrences of the error to appear:

2017-09-05 20:23:57 Creating test3 2017-09-05 20:23:57 Creating test15 2017-09-05 20:23:57 Creating test13 [...snip] 2017-09-05 20:23:58 Creating test4 2017-09-05 20:23:58 Creating test14 test11: wait for all lxc done error: websocket: close 1006 (abnormal closure): unexpected EOF 2017-09-05 20:24:15 test8: wait for all lxc done 2017-09-05 20:24:17 test11: wait for all lxc done 2017-09-05 20:24:17 test8: wait for all lxc done error: websocket: close 1006 (abnormal closure): unexpected EOF test13: wait for all lxc done [...]

Information to attach

strace-lxd-monitor.log.txt lxd-debug.log.txt lxc-client-debug.log.txt lsof-debug.log.txt

stgraber commented 7 years ago

Unable to reproduce this on my local system. Will try on a clean Ubuntu 16.04 next.

stgraber commented 7 years ago

Same result on a fresh system:

stgraber@castiana:~/data/code/lxc/lxd-pkg-snap (latest-candidate)$ rssh ubuntu@vm01.maas.mtl
Warning: Permanently added 'vm01.maas.mtl' (ECDSA) to the list of known hosts.
root@vm01:~# apt update
Hit:1 http://us.archive.ubuntu.com/ubuntu xenial InRelease
Hit:2 http://us.archive.ubuntu.com/ubuntu xenial-updates InRelease
Hit:3 http://us.archive.ubuntu.com/ubuntu xenial-backports InRelease
Hit:4 http://us.archive.ubuntu.com/ubuntu xenial-security InRelease
Reading package lists... Done                     
Building dependency tree       
Reading state information... Done
2 packages can be upgraded. Run 'apt list --upgradable' to see them.
root@vm01:~# apt dist-upgrade
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
The following packages will be upgraded:
  snapd ubuntu-core-launcher
2 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 10.7 MB of archives.
After this operation, 4,466 kB of additional disk space will be used.
Do you want to continue? [Y/n] 
Get:1 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 ubuntu-core-launcher amd64 2.27.5 [1,562 B]
Get:2 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 snapd amd64 2.27.5 [10.7 MB]
Fetched 10.7 MB in 0s (45.8 MB/s)
(Reading database ... 59856 files and directories currently installed.)
Preparing to unpack .../ubuntu-core-launcher_2.27.5_amd64.deb ...
Unpacking ubuntu-core-launcher (2.27.5) over (2.26.10) ...
Preparing to unpack .../snapd_2.27.5_amd64.deb ...
Warning: Stopping snapd.service, but it can still be activated by:
  snapd.socket
Unpacking snapd (2.27.5) over (2.26.10) ...
Processing triggers for man-db (2.7.5-1) ...
Setting up snapd (2.27.5) ...
Installing new version of config file /etc/apparmor.d/usr.lib.snapd.snap-confine.real ...
Setting up ubuntu-core-launcher (2.27.5) ...
root@vm01:~# apt install -t xenial-backports lxd lxd-client
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  ebtables xdelta3
Suggested packages:
  criu lxd-tools
The following NEW packages will be installed:
  ebtables xdelta3
The following packages will be upgraded:
  lxd lxd-client
2 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
Need to get 5,123 kB of archives.
After this operation, 1,945 kB disk space will be freed.
Do you want to continue? [Y/n] 
Get:1 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 ebtables amd64 2.0.10.4-3.4ubuntu2 [79.4 kB]
Get:2 http://us.archive.ubuntu.com/ubuntu xenial-backports/main amd64 lxd amd64 2.17-0ubuntu2~ubuntu16.04.1 [3,000 kB]
Get:3 http://us.archive.ubuntu.com/ubuntu xenial-backports/main amd64 lxd-client amd64 2.17-0ubuntu2~ubuntu16.04.1 [1,976 kB]
Get:4 http://us.archive.ubuntu.com/ubuntu xenial/universe amd64 xdelta3 amd64 3.0.8-dfsg-1ubuntu2 [67.1 kB]
Fetched 5,123 kB in 0s (14.8 MB/s) 
Selecting previously unselected package ebtables.
(Reading database ... 59859 files and directories currently installed.)
Preparing to unpack .../ebtables_2.0.10.4-3.4ubuntu2_amd64.deb ...
Unpacking ebtables (2.0.10.4-3.4ubuntu2) ...
Preparing to unpack .../lxd_2.17-0ubuntu2~ubuntu16.04.1_amd64.deb ...
Warning: Stopping lxd.service, but it can still be activated by:
  lxd.socket
Unpacking lxd (2.17-0ubuntu2~ubuntu16.04.1) over (2.0.10-0ubuntu1~16.04.2) ...
Warning: Stopping lxd.service, but it can still be activated by:
  lxd.socket
Preparing to unpack .../lxd-client_2.17-0ubuntu2~ubuntu16.04.1_amd64.deb ...
Unpacking lxd-client (2.17-0ubuntu2~ubuntu16.04.1) over (2.0.10-0ubuntu1~16.04.2) ...
Selecting previously unselected package xdelta3.
Preparing to unpack .../xdelta3_3.0.8-dfsg-1ubuntu2_amd64.deb ...
Unpacking xdelta3 (3.0.8-dfsg-1ubuntu2) ...
Processing triggers for systemd (229-4ubuntu19) ...
Processing triggers for ureadahead (0.100.0-19) ...
Processing triggers for man-db (2.7.5-1) ...
Setting up ebtables (2.0.10.4-3.4ubuntu2) ...
update-rc.d: warning: start and stop actions are no longer supported; falling back to defaults
Setting up lxd-client (2.17-0ubuntu2~ubuntu16.04.1) ...
Setting up xdelta3 (3.0.8-dfsg-1ubuntu2) ...
Setting up lxd (2.17-0ubuntu2~ubuntu16.04.1) ...
Installing new version of config file /etc/init.d/lxd ...
Removing obsolete conffile /etc/init/lxd.conf ...
Setting up lxd dnsmasq configuration.

Old bridge configuration detected in /etc/default/lxd-bridge, upgrading
Unsetting deprecated profile options
Attempting to kill current lxd-bridge
Stock LXD installation detected, resetting to new defaults
Cleaning up lxd-bridge state files

Processing triggers for systemd (229-4ubuntu19) ...
Processing triggers for ureadahead (0.100.0-19) ...
root@vm01:~# lxd init
Do you want to configure a new storage pool (yes/no) [default=yes]? 
Name of the new storage pool [default=default]: 
Name of the storage backend to use (dir, btrfs, lvm) [default=dir]: 
Would you like LXD to be available over the network (yes/no) [default=no]? 
Would you like stale cached images to be updated automatically (yes/no) [default=yes]? 
Would you like to create a new network bridge (yes/no) [default=yes]? 
What should the new bridge be called [default=lxdbr0]? 
What IPv4 address should be used (CIDR subnet notation, “auto” or “none”) [default=auto]? 
What IPv6 address should be used (CIDR subnet notation, “auto” or “none”) [default=auto]? 
LXD has been successfully configured.
root@vm01:~# mkdir a
root@vm01:~# rmdir a
root@vm01:~# cd /home/ubuntu/
root@vm01:/home/ubuntu# ls -lh
total 8.0K
-rwxr-xr-x 1 ubuntu ubuntu 209 Sep  6 04:04 reproducer.sh
-rwxr-xr-x 1 ubuntu ubuntu  77 Sep  6 04:04 start.sh
root@vm01:/home/ubuntu# ./start.sh 
If this is your first time using LXD, you should also run: lxd init
To start your first container, try: lxc launch ubuntu:16.04

Creating test1
Creating test2
Creating test10
Creating test16
Creating test13
Creating test19
Creating test12
Creating test8
Creating test3
Creating test6
Creating test20
Creating test14
Creating test17
Creating test18
Creating test4
Creating test9
Creating test11
Creating test7
Creating test5
Creating test15
test9: wait for all lxc done0% (8.54MB/s)   
test19: wait for all lxc done
test18: wait for all lxc done
test1: wait for all lxc done
test13: wait for all lxc done
test14: wait for all lxc done
test3: wait for all lxc done
test16: wait for all lxc done
test11: wait for all lxc done
test8: wait for all lxc done
test10: wait for all lxc done
test2: wait for all lxc done
test15: wait for all lxc done
test5: wait for all lxc done
test6: wait for all lxc done
test20: wait for all lxc done
test7: wait for all lxc done
test4: wait for all lxc done
test12: wait for all lxc done   

Been sitting there for 5 minutes or so, running "lxc exec" against the various containers.

stgraber commented 7 years ago

What kind of load is your system under when those errors show up?

I've managed to get LXD to get some of those when I interact with over 100 containers in parallel, at first glance those are caused by the system being so loaded that it's not scheduling enough time for the websocket client to react to events before the daemon closes the session as a result of the process exitting.

stgraber commented 7 years ago

It looks like in the past such errors were being silently dropped, the new client code is much stricter about catching such websocket errors and surfacing them back to the user.

My guess is that the old code would suffer from the exact same issue but rather than showing you an error, it would simply exit 0 and never report that some of the data got lost.

stgraber commented 7 years ago

Increasing parallelism even more, we eventually even hit a bunch of "error: not found", which is what you'd expect if the client gets so busy that it fails to retrieve the operation data within 5s of the operation having completed (the client attempts to fetch state of an operation that's already been garbage collected).

niedbalski commented 7 years ago

@stgraber

Thanks for looking into this.

OK, I can't definitively reproduce the problem in a newly deployed xenial virtual machine, but on a production system with moderate load average (56 cores, 22 points of load average, 5 minutes window) it's fairly easy to get the "error: websocket: close 1006 (abnormal closure): unexpected EOF" errors during the lxc exec loop" on 2.17. Maybe as part of the reproducer we should stress and induct some cpu load into the system.

As you already mentioned these errors were probably dismissed in earlier versions (before https://github.com/lxc/lxd/commit/e4bd65bbff29675a8e0e24c623c7038ab89e0bbe) and this was getting unnoticed by the tools that make use of lxc exec in parallel.

Also, I am not versed yet on the lxd internals, but assuming that the idea of " being so loaded that it's not scheduling enough time for the websocket client to react to events before the daemon closes the session " is true, why is this happening? Shouldn't be the writer waiting for all the message/data to be flushed/read on the socket before ending the session?

stgraber commented 7 years ago

The LXD daemon is pretty aggressive on websocket handling, if the client doesn't handle events in a timely manner, the connection gets dropped. That's needed so that we don't end up with a bunch of open connections on the server when a network problem occurs. It's obviously less relevant for clients using the unix socket, but the LXD daemon doesn't know how someone connected to the API.

niedbalski commented 7 years ago

@stgraber Could you please point me to the current timeouts/deadlines at the server and client side? I don't think my system was so overloaded as I'd expect ms latency as no swap was in use at the moment of the problem and the load average was high, but not extreme at the seconds levels ..

Also, if you look at the specific websocket error 1006 (abnormal closure) this seems to indicate that the problem is that the connection was dropped without a proper close frame being sent [0] and according to the documentation this is typically caused by proxies that doesn't handles the termination properly.

"Used to indicate that a connection was closed abnormally (that is, with no close frame being sent) when a status code is expected."

My understanding is that at that point the client was either reading or terminating the connection when the server decided to abruptly close the connection without sending the close frame. Is that behavior expected?

[0] https://developer.mozilla.org/en-US/docs/Web/API/CloseEvent

stgraber commented 7 years ago

Sent a pull request which fixes a race in the server side code. This fixes at least one cause for the message you've been getting at least on my test systems.

stgraber commented 7 years ago

@niedbalski LXD never sends a close frame because it never closes that connection. A /1.0/events API call is never ending. The only way it can get disconnected is the server going down or network failure, both of which cause a network disconnection without a websocket close frame.

niedbalski commented 7 years ago

@stgraber

I just tested a fresh build of 2.17 with the patch for #3771 applied on top, and it seems that the reported message (close 1006) isn't being returned anymore.

Thank you!

array42-zz commented 6 years ago

Closed? I still see that message with lxd --version 3.0.0. Error: websocket: close 1006 (abnormal closure): unexpected EOF

I installed lxd using snap, container is running, connected to it with exec bash and leave the session open. Some time later the error occurs, the open terminal sessions close.

cyberwillis commented 6 years ago

I suspect that when snaps refresh with a updated version, the LXD kick out the user in a session.

I was just doing one job using two containers on LXD, when after some minutes I was kicked out in both ... Then without any reason I just tried to look what version I was running and what build, and I saw that it has updated my version in less than one minute.

I don't know how exactly how snaps manages this but now I just set a scheduled update for it.

mpontillo commented 6 years ago

Yes, I just noticed the same issue; it seems I get this error whenever the snap updates itself.

cyberwillis commented 6 years ago

@mpontillo the only thing you can do about it is to schedule your updates to a specific day of month or else compile your own lxd server

https://snapdocs.labix.org/configuration-options/87

mpontillo commented 6 years ago

Yes, my workaround for now is sudo snap set lxd refresh.timer=fri,23:00-01:00; this way I'm not interrupted with updates during the work week.

cyberwillis commented 6 years ago

Great ... just notice that you can set fri1 or fri5 to set it for the first or last friday of the month.

charlie-ht commented 5 years ago

I saw this today, on an unloaded system running lxd 3.16. It's not clear how to reproduce. Are there logs that can be attached?

cyberwillis commented 5 years ago

I think one way to reproduce this is:

Then you will see it stop your container to do the update.

ilyaevseev commented 4 years ago

The same problem on many servers with lxd 3.0.3-0ubuntu1~18.04.1 on Ubuntu 18.04.3. Happens occasionally, difficult to reproduce.

danielhz commented 3 years ago

I got this error message when running a bulk loader of data in a database inside a container. This job has to load the data and create indexes, so it takes several hours and overloaded the system. At some point I got an "Error: websocket: close 1006 (abnormal closure): unexpected EOF." With datasets of similar sizes, sometimes it succeeds, sometimes it fails. I had a console doing this work inside a lxc exec container bash terminal. After having this error the container bash terminal is closed. I am using LXD 4.11 with Ubuntu 18.04 in the host and Ubuntu 20.04 in the container.

stgraber commented 3 years ago

That's usually the sign of a crash from the daemon.

Look at 'journalctl -u snap.lxd.daemon' for a crash or maybe 'dmesg' for an indication of your system running out of memory and LXD being killed because of it.

cyberwillis commented 3 years ago

Hi @danielhz ,

if you was connected to the terminal the moment your snap updated the LXD to new build version, you will be kicked out until the process finish update or upgrade.

danielhz commented 3 years ago

I have this message several times at the end of journalctl -u snap.lxd.daemon:

Feb 09 21:56:23 container lxd.daemon[869]: t=2021-02-09T21:56:23+0000 lvl=warn msg="Detected poll(POLLNVAL) event."
cyberwillis commented 3 years ago

Hi @danielhz ,

try to see the date and time your lxd was refreshed see if it match with the time this happened

snap info lxd
or snap changes

danielhz commented 3 years ago

Hi @cyberwillis,

In the host, I opened a tmux terminal. Then, I opened a connection to the container with lxc exec container bash. Inside the container I started the job. Then, I detached the tmux terminal. After some hours, I attached the terminal again, and I see that the connection with the container was lost.

@cyberwillis, I think that your hypothesis of that the connection was kicked out because snap was updating to a new version is very likely. Indeed, I have three machines running jobs in a tmux terminal, and LXD connection was lost in all of them. In one of them the job continues after the LXD connection was lost. The container is still running the job I initiated. I checked that I have two LXD revisions installed 19300 and 19325. I guess that revision 19325 was installed automatically while I was running the job.

There are the following messages in the host:

$ journalctl -u snap.lxd.daemon
...
Feb 10 03:15:17 hostmachine systemd[1]: Stopping Service for snap application lxd.daemon...
Feb 10 03:15:18 hostmachine lxd.daemon[57298]: => Stop reason is: snap refresh
Feb 10 03:15:18 hostmachine lxd.daemon[57298]: => Stopping LXD
Feb 10 03:20:19 hostmachine lxd.daemon[57298]: ==> Stopped LXD
Feb 10 03:20:19 hostmachine systemd[1]: Stopped Service for snap application lxd.daemon.
Feb 10 03:20:23 hostmachine systemd[1]: Started Service for snap application lxd.daemon.
Feb 10 03:20:23 hostmachine lxd.daemon[58060]: => Preparing the system (19325)
Feb 10 03:20:23 hostmachine lxd.daemon[58060]: ==> Loading snap configuration

And I have the following information for the lxd snap:

$ snap info lxd
...
refresh-date: today at 03:20 CET
cyberwillis commented 3 years ago

Hi @danielhz ,

If you execute : snap changes you will see the ID of the snap that refreshed the LXD then you can pass snap tasks <ID> to verify too.

This happened to me before, so in my case was the update that kicked me out.

There are one solution for that if that is the case. sudo snap set system refresh.hold="2999-01-01T01:00:00-01:00" then start frequently update your package manually.

danielhz commented 3 years ago

Hi @cyberwillis, your hypothesis is very likely. It also helps me to know how I can delay snap updates to start updating packages manually.

cpaelzer commented 2 years ago

Hi, this seems to be a long running issue and coming up every now and then but stays hard to recreate or debug. I'm used to see it break about 1/5 of my Tests runs (each about 5-8 hours) which heavily utilize LXD on a 12 core system. So far I also just assumed that it might be the snap update that runs in background and a few times that was even true - at least the timing matched. But last night I had another case where a snap update was not even close. I'm providing the info of that for the sake of helping to find and fix the remaining underlying issue.

First of all, what I see is the same Error: websocket: close 1006 (abnormal closure): unexpected EOF that so far everyone reported and is just the message for an incomplete close in the go websocket implementation.

But this time I had a rather clean lxd snap as I re-installed 4-0-stable yesterday to clean up some old crap.

$ snap changes
ID   Status  Spawn                   Ready                   Summary
29   Done    yesterday at 08:31 UTC  yesterday at 08:46 UTC  Remove "lxd" snap
30   Done    yesterday at 08:48 UTC  yesterday at 08:48 UTC  Install "lxd" snap from "4.0/stable" channel

For a more detailed version, I'm on 4.0/stable: 4.0.9-8e2046b 2022-03-26 (22753) 71MB -

There also isn't any further change since then. And for more cleanup I rebooted the machine right after that LXD re-install. The websocket issue occurred at ~22:20 14h after the last install/update.

In the journal of the daemon I saw some transaction issues that weren't reported here before. And timing wise they match perfectly to when I found the issue in my logs.

$ journalctl -u snap.lxd.daemon -n 8
Jun 21 09:56:54 node-horsea lxd.daemon[3334]: - pidfds
Jun 21 09:56:54 node-horsea lxd.daemon[3334]: Reloaded LXCFS
Jun 21 09:57:22 node-horsea lxd.daemon[128359]: => LXD is ready
Jun 21 22:19:57 node-horsea lxd.daemon[128505]: t=2022-06-21T22:19:57+0000 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" me>
Jun 21 22:19:58 node-horsea lxd.daemon[128505]: t=2022-06-21T22:19:58+0000 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" me>
Jun 21 22:19:58 node-horsea lxd.daemon[128505]: t=2022-06-21T22:19:58+0000 lvl=eror msg="Failed getting volumes for auto custom volume snapshot task" err="failed to begin transaction: contex>
Jun 21 22:19:59 node-horsea lxd.daemon[128505]: t=2022-06-21T22:19:59+0000 lvl=warn msg="Failed to rollback transaction after error (Local node name: sql: transaction has already been commit>
Jun 21 22:19:59 node-horsea lxd.daemon[128505]: t=2022-06-21T22:19:59+0000 lvl=eror msg="Failed to load instances for snapshot expiry" err="Local node name: sql: transaction has already been>

I checked further if/what happened around that time, but there was nothing obvious. A few ssh logins from another source and livepatch being restarted - none seemed to corelate. BTW - past occurrences of the websocket issue also did not report the transaction issue, this was the first time I saw it.