Problem with container.execute() timeout/connection reset

CluelessTechnologist commented 5 years ago

Hi,

We are having some issues with timeout/connection reset. Basically we are trying to build a script that deploys a MongoDB cluster and it seems like everything is working but in the middle of the script pylxd just times out or whatever. All of a sudden the script just pauses and then continues but shows this error message. After entering the containers it seems like it successfully started MongoDB and everything is fine but still we get this error when we run the script?

Versions: lxd 3.03 pylxd 2.2.10


Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ws4py/websocket.py", line 394, in once
    b = self.sock.recv(self.reading_buffer_size)
  File "/usr/lib64/python3.7/ssl.py", line 1056, in recv
    return self.read(buflen)
  File "/usr/lib64/python3.7/ssl.py", line 931, in read
    return self._sslobj.read(len)
ConnectionResetError: [Errno 104] Connection reset by peer
Failed to receive data
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ws4py/websocket.py", line 394, in once
    b = self.sock.recv(self.reading_buffer_size)
  File "/usr/lib64/python3.7/ssl.py", line 1056, in recv
    return self.read(buflen)
  File "/usr/lib64/python3.7/ssl.py", line 931, in read
    return self._sslobj.read(len)
ConnectionResetError: [Errno 104] Connection reset by peer
ContainerExecuteResult(exit_code=0, stdout='Success: MongoDB started sucessfully\n', stderr='')

Excerpt from script that deals with container.execute():

def push_file(conName, impPath, expPath):
    filedata = open(impPath).read()
    container = client.containers.get(conName)
    container.files.put(expPath, filedata)
def start_mongodb(conName):
    if 'query' not in conName:
        container = client.containers.get(conName)
        container.execute(["systemctl","start","mongod"])
        print(container.execute(["systemctl","status","mongod"]))
    else:
        container = client.containers.get(conName)
        container.execute(["systemctl","start","mongos"])
        print(container.execute(["systemctl","is-active","mongos"]))

ajkavanagh commented 5 years ago

It looks like lxd is closing the connection (Connection reset by peer). 3.0.3 is quite an old (but still maintained) version of lxd, (I'm guessing the bionic packaged version). Does it happen with the snap version (3.17)?

There's been a range of issues around the container.execute() method, but I had thought we'd ironed them out; but it still may be a pylxd issue. It would be great if you could collect some timings around how long the script is running, etc., to see if it's (maybe) a keepalive issue or something similar.

Thanks.

CluelessTechnologist commented 5 years ago

It looks like lxd is closing the connection (Connection reset by peer). 3.0.3 is quite an old (but still maintained) version of lxd, (I'm guessing the bionic packaged version). Does it happen with the snap version (3.17)?

It's better with 3.18 but we seem to always get timeout on the first execute when running this block:

    for entry in containerList:
        if 'query' not in entry:
            container = client.containers.get(entry)
            if 'Active: active (running)' in str(container.execute(["systemctl","status","mongod"])):
                print('MongoDB Service (mongod) is active [',entry']')
            else:
                print('MongoDB Service (mongod) is not active [',entry']')
        else:
            container = client.containers.get(entry)
            if 'Active: active (running)' in str(container.execute(["systemctl","status","mongos"])):
                print('MongoDB Service (mongos) is active [',entry']')
            else:
                print('MongoDB Service (mongos) is not active [',entry']')

In total we do 7 executes about half of them timeout. Can't really find a pattern sometimes multiple executes are OK in a row and sometimes multiple timeouts in a row. However we have not seen any timeouts when running this block: (also 7 times)

        container = client.containers.get(conName)
        container.execute(["systemctl","start","mongod"])

There's been a range of issues around the container.execute() method, but I had thought we'd ironed them out; but it still may be a pylxd issue. It would be great if you could collect some timings around how long the script is running, etc., to see if it's (maybe) a keepalive issue or something similar.

This script takes quite a long time to run. 5+ minutes, but it's only the last part of the script that is causing timeouts. (output of systemctl status)

Also after upgrading to 3.18 we now receive these warnings:

/home/admin/.local/lib/python3.7/site-packages/pylxd/models/operation.py:79: UserWarning: Attempted to set unknown attribute "location" on instance of "Operation"
  .format(key, self.__class__.__name__))

/home/admin/.local/lib/python3.7/site-packages/pylxd/models/_model.py:137: UserWarning: Attempted to set unknown attribute "type" on instance of "Container"
  key, self.__class__.__name__

ajkavanagh commented 4 years ago

Ah, so this is some variation of the repeat of the long running test issue that's come back to bite, several times now. It seems that the websocket code is particularly sensitive to changes how lxd handles and closes its websockets, during long running processes.

Is it possible to isolate what's happening into a test (like in the integration-tests or contrib-testing directories). It's not clear whether it's an idle connection that resets (and whether pylxd or lxd does that), or an overflowing buffer/too much data causing a reset, or something else? If you can't isolate the test into a 'test', can you get any logs or stack traces?

Incidentally, master with lxd 3.18 is also currently broken on container.execute() due to a broken pipe error: see #379 for more details. is it the same thing?

hatkidchan commented 1 year ago

Still happens, LXD 5.8, pylxd 2.3.1 minimal reproducible code on my system:

from pylxd import Client
client = Client()
container = client.containers.get("test")
print(container.execute(["ls", "/"]).stdout)
""" =>
Failed to receive data
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/ws4py/websocket.py", line 394, in once
    b = self.sock.recv(self.reading_buffer_size)
ConnectionResetError: [Errno 104] Connection reset by peer
bin
dev
etc
home
lib
...
"""

So it does return result of execution, but shows that error in the middle of execution. Is there any way to at least silently handle that?

sirkadirov commented 1 year ago

Hi! I have a similar issue when executing any command on instance with LXD 5.13 on Ubuntu 22.04 LTS:

/usr/local/lib/python3.10/dist-packages/pylxd/models/_model.py:146: UserWarning: Attempted to set unknown attribute "project" on instance of "Instance"
  warnings.warn(
Failed to receive data
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ws4py/websocket.py", line 394, in once
    b = self.sock.recv(self.reading_buffer_size)
ConnectionResetError: [Errno 104] Connection reset by peer
2023-04-20 12:52:28.2129 [ERROR from Sirkadirov.Overtest.Daemon.ScriptingPlatform.ScriptingPlatform] Python.Runtime.PythonException: __enter__
  File "/home/sirkadirov/projects/overtest/overtest-daemon/bin/Debug/net7.0/scripts/lxc_template_gen.py", line 113, in generate_template
    with template_container.execute(['apt-get', '-y', '-q', 'install', 'python3', 'python3-pip']) as exec_result:

Now I'm trying to figure out how to bypass or resolve it.

sirkadirov commented 1 year ago

@CluelessTechnologist @hatkidchan I solved the issue with errors during commands execution inside LXC containers just by using latest pylxd package built from the source. You can install it with this command:

pip install git+https://github.com/lxc/pylxd

For more see this discussion on the Linux Containers forum.

@rockstar I think it's time to close this issue. Thanks for your work!

canonical / pylxd

Problem with container.execute() timeout/connection reset #376