Closed MaximMonin closed 2 years ago
Some of those errors are likely coming because we're now running with websocket heartbeats enable so if the client or server doesn't respond for a while, it will sever the connection, potentially causing one of those bad handshake
or unexpected EOF
.
@tomponline may have some ideas for improvements in those areas.
When LXD is misbehaving, there are a few things you could do or grab which may be useful:
dmesg
(looking for any hung CPU or task, OOM and the like)ps fauxww
(looking for processes stuck in D
state, especially if some of those are LXD or LXC relatedWhen lxc list
hangs but ?recursion=1
doesn't, this normally indicates that one or multiple instances are not responding when LXD fetches their status. You could try to iterate through all of your instances, running lxc info
against them and see which ones are hanging.
Another dump that can be useful is the dump of all ongoing goroutines in LXD. You can get that with:
@stgraber i have similar problems but in a clustered enviroment.
# lxc cluster list
Error: context deadline exceeded
also commands will fail more often, like a restart when its not done within 30s. especially in large clusters it some times just takes a little bit longer... any chance to fix this?
@stgraber i have similar problems but in a clustered enviroment.
# lxc cluster list Error: context deadline exceeded
also commands will fail more often, like a restart when its not done within 30s. especially in large clusters it some times just takes a little bit longer... any chance to fix this?
There are no websockets (and thus no heartbeats) involved with the lxc cluster list
command, so if you are getting context deadline exceeded there its most likely due to locking on the database. This can be caused by i/o saturation on the leader member and/or network packet loss or latency between the members. It can also be caused if the cluster has lost quorum and is no longer operational due to too many members becoming offline. I would suggest consulting the log file /var/snap/lxd/common/lxd/logs/lxd.log
for each member and seeing if there are any more errors that could help identify the issue.
We are use lxd 4.21 during last month and slowness increases with new version. Now we are getting new lxd errors:
* failed to begin transaction: context deadline exceeded * websocket: bad handshake * websocket: close 1006 (abnormal closure): unexpected EOF
The first of those errors is to do with the database and is not to do with websockets (or heartbeats), see my post https://github.com/lxc/lxd/issues/9806#issuecomment-1033521647 on how to investigate that more.
The second of those errors suggests that the websocket has never successfully been connected (as opposed to timing out due to inactivity), this can happen (at least with lxc copy
, I'm not sure if its relevant to lxc exec
though) when the target member of a copy fails to connect back to the source member within 10s, and the source member gives up. Similarly to the first error this can suggest that the target member is not performing well or the network between it and the leader is experiencing packet loss or high latency (LXD cluster expects to run in a LAN network environment rather than over a WAN).
See https://github.com/lxc/lxd/issues/9861 for potentially similar issue.
The last error could be to do with a close caused by a timeout, although for lxc exec
there is no websocket timeout currently implemented (beyond the standard TCP keepalive timeout that has been there for years).
My suspicion is that this is due database issues caused either by i/o, network or excessive traffic on the leader member.
Please can you provide the specific commands along with the errors you get (rather than just the errors as now), and any associated logs from /var/snap/lxd/common/lxd/logs/lxd.log
on the affected members.
Some of those errors are likely coming because we're now running with websocket heartbeats
@stgraber I checked just now and for clarity, right now, only the events websocket connections have websocket level heartbeats implemented (although exec sessions will have an associated events listener), and only the dqlite proxy and the migration source websocket have TCP user_timeouts (send) applied to them.
events listener: https://github.com/lxc/lxd/blob/49f90c047ced40b933e167016197b0d6fc9fb660/lxd/events/common.go#L43-L56 migrate source: https://github.com/lxc/lxd/blob/42bd6dcf91aa3fb0ae4a2ece1ff9c0c1fadd6a98/lxd/migrate.go#L185 dqlite proxy: https://github.com/lxc/lxd/blob/42bd6dcf91aa3fb0ae4a2ece1ff9c0c1fadd6a98/lxd/cluster/gateway.go#L1074
My suspicion is that this is due database issues caused either by i/o, network or excessive traffic on the leader member. Please can you provide the specific commands along with the errors you get (rather than just the errors as now), and any associated logs from
/var/snap/lxd/common/lxd/logs/lxd.log
on the affected members.
all a see in lxd logs are:
t=2022-02-08T11:35:18+0200 lvl=warn msg="Unexpected read on stdout websocket, killing command" PID=442238 err="websocket: close 1006 (abnormal closure): unexpected EOF" instance=ct4874 interactive=false number=1 project=default
t=2022-02-08T11:37:16+0200 lvl=warn msg="Unexpected read on stdout websocket, killing command" PID=448205 err="websocket: close 1006 (abnormal closure): unexpected EOF" instance=ct4874 interactive=false number=1 project=default
t=2022-02-08T11:44:41+0200 lvl=warn msg="Unexpected read on stdout websocket, killing command" PID=476069 err="websocket: close 1006 (abnormal closure): unexpected EOF" instance=ct4874 interactive=false number=1 project=default
Unexpected read on stdout websocket
OK this is more useful.
The reason LXD is closing the connection is that its receiving data on its stdout channel going to the client (which it should not as it should only be used to send stdout data from LXD to the client).
See:
This "data" can include the stdout channel being disconnected, which then causes LXD to kill the command and end the exec session to avoid leaving go routines and processes around.
Can you give more information/examples on your usage that is triggering this scenario please?
Also what version of the lxc
command are you using? lxc --version
Can you give more information/examples on your usage that is triggering this scenario please?
I checked our host go client log and it shows that there arent any websocket errors in this time frame. All commands were succesfully returned result. But I found this kind of errors in different time frame... trying to investigate more....
Also what version of the
lxc
command are you using?lxc --version
# lxc --version
4.22
go client log
Ah so are you using a client different than the lxc
CLI client tool?
go client log
Ah so are you using a client different than the
lxc
CLI client tool?
yes, we are using https://github.com/lxc/lxd/tree/master/client version 4.21
ok investigations:
{"errors":"websocket: bad handshake","timestamp":"2022-02-06T01:38:41.870Z"}...
t=2022-02-06T03:38:36+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:38:39+0200 lvl=eror msg="Error getting disk usage" err="failed to begin transaction: context deadline exceeded" instance=ct3914 instanceType=container project=default
t=2022-02-06T03:38:39+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:38:39+0200 lvl=eror msg="Error getting disk usage" err="failed to begin transaction: context deadline exceeded" instance=ct9293 instanceType=container project=default
t=2022-02-06T03:39:13+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:39:13+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:39:13+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:39:13+0200 lvl=eror msg="Error loading storage pool" err="failed to begin transaction: context deadline exceeded" instance=ct9285 instanceType=container project=default
t=2022-02-06T03:39:13+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:39:13+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:39:13+0200 lvl=eror msg="Error loading storage pool" err="failed to begin transaction: context deadline exceeded" instance=ct9354 instanceType=container project=default
t=2022-02-06T03:39:14+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:39:16+0200 lvl=warn msg="Transaction timed out. Retrying once" err="failed to begin transaction: context deadline exceeded" member=1
t=2022-02-06T03:39:33+0200 lvl=eror msg="Error loading storage pool" err="failed to begin transaction: context deadline exceeded" instance=ct9303 instanceType=container project=default
It seems it is night backup use case.
Yeah so these are issues accessing the database rather than anything to do with websockets, looks like the machine maybe suffering I/O issues at that time.
Closing for inactivity and because all previous posts point towards other I/O related issues which were causing the database to timeout.
Required information
Issue description
We are using lxc exec command to do some container management. Average reply from lxc exec call is about 120-150 ms on system with 1 container w/o any load. We have servers with 300+ lxd containers and average reply increases to 150-250ms. But in some moments ldx hangs and reply can be 5,10 second or even minutes. We are use lxd 4.21 during last month and slowness increases with new version. Now we are getting new lxd errors:
During last month we had 2 occurance when lxd completely hungs up and and all containers were unavailable (all lxc commands do not provide any output).
Steps to reproduce
Information to attach