Rahix / tbot

Automation/Testing tool for Embedded Linux Development
https://tbot.tools
GNU General Public License v3.0
84 stars 21 forks source link

Memory leak in SubprocessChannelIO #87

Closed mebel-christ closed 1 year ago

mebel-christ commented 1 year ago

Lately I did some experiments with tbot powercycling a DUT via SSH which worked like a charm at first but kept failing after a few hundred cycles. After some investigation I found that this seems to occur since tbot does not properly close some file-descriptors. To simplify the problem (and hopefully make it reproducible) I ran tbot with following setup:

config/eval.py:

from tbot import role
from tbot.machine import board, connector

class EvalBoard(connector.SubprocessConnector, board.Board):
    pass

def register_machines(ctx):
    ctx.register(EvalBoard, role.Board)

tc/test_fd.py:

import os
import tbot
import time

from tbot import role

@tbot.testcase
def test_fds():
    print(os.getpid())
    time.sleep(5)

    for i in range(2000):
        with tbot.ctx.request(role.Board) as b:
            tbot.log.message(f"RUN {i}")
            b.exec("echo", "Hello tbot!")

I execute the setup without any additional configuration like this: newbot -c config.eval tc.test_fd.test_fds At some point in time (I guess this depends on your machine) this fails with an error message (I saw different kinds of errors, depending on wether I use SubprocessConnector or SSHConnector) like this:

│   ├─RUN 1015
│   ├─RUN 1016
│   └─Fail. (22.903s)
├─Exception:
│   Traceback (most recent call last):
│     File "/home/mebel/Environments/envtb/lib/python3.8/site-packages/tbot/machine/machine.py", line 158, in __enter__
│       self.ch = self._cx.enter_context(self._connect())
│     File "/home/mebel/Environments/envtb/lib/python3.8/site-packages/tbot/machine/connector/common.py", line 62, in _connect
│       return channel.SubprocessChannel()
│     File "/home/mebel/Environments/envtb/lib/python3.8/site-packages/tbot/machine/channel/subprocess.py", line 142, in __init__
│       super().__init__(SubprocessChannelIO())
│     File "/home/mebel/Environments/envtb/lib/python3.8/site-packages/tbot/machine/channel/subprocess.py", line 41, in __init__
│       self.p = subprocess.Popen(
│     File "/home/mebel/Environments/envtb/lib/python3.8/subprocess.py", line 858, in __init__
│       self._execute_child(args, executable, preexec_fn, close_fds,
│     File "/home/mebel/Environments/envtb/lib/python3.8/subprocess.py", line 1605, in _execute_child
│       errpipe_read, errpipe_write = os.pipe()
│   OSError: [Errno 24] Too many open files
├─────────────────────────────────────────
└─FAILURE (22.936s)

I modified the test like this to "freeze" execution at some point in time:

@tbot.testcase
def test_fds():
    print(os.getpid())
    time.sleep(5)

    for i in range(2000):
        with tbot.ctx.request(role.Board) as b:
            tbot.log.message(f"RUN {i}")
            b.exec("echo", "Hello tbot!")
            if i >= 5:
                b.interactive()

I get this output:

tbot starting ...
├─Calling test_fds ...
61006
│   ├─RUN 0
│   ├─RUN 1
│   ├─RUN 2
│   ├─RUN 3
│   ├─RUN 4
│   ├─RUN 5
│   ├─Entering interactive shell...
│   ├─Press CTRL+] three times within 1 second to exit.
bash-5.0$ Hello tbot!
bash-5.0$

When I inspect the open file-descriptors belonging to the tbot process with lsof -p 61006 I see the following:

...
newbot  61006 mebel    0u   CHR  136,6      0t0        9 /dev/pts/6
newbot  61006 mebel    1u   CHR  136,6      0t0        9 /dev/pts/6
newbot  61006 mebel    2u   CHR  136,6      0t0        9 /dev/pts/6
newbot  61006 mebel    3r   CHR    1,9      0t0       10 /dev/urandom
newbot  61006 mebel    4u   CHR    5,2      0t0       88 /dev/ptmx
newbot  61006 mebel    5u   CHR 136,10      0t0       13 /dev/pts/10 (deleted)
newbot  61006 mebel    6u   CHR 136,11      0t0       14 /dev/pts/11 (deleted)
newbot  61006 mebel    7u   CHR 136,12      0t0       15 /dev/pts/12 (deleted)
newbot  61006 mebel    8u   CHR 136,13      0t0       16 /dev/pts/13 (deleted)
newbot  61006 mebel    9u   CHR 136,14      0t0       17 /dev/pts/14 (deleted)
newbot  61006 mebel   10u   CHR 136,15      0t0       18 /dev/pts/15

It seems like the pseudo-terminals are closed properly (since they get deleted) but the file-descriptors remain open. @Rahix I already found a fix for the described behavior and will happily supply a PR, but could you first try to reproduce my issue and confirm it as a bug? Maybe even I am doing something wrong or unintended with my test as well?

Rahix commented 1 year ago

Good catch, thanks for reporting!

Please just feel free to send pull-requests as soon as you have any code! It's always easier to talk about things when there is already some code to reference, even if it isn't the final version or the prettiest solution ;)

In any case, I will try reproducing this. Keep in mind that hitting the bug depends on your local system's file descriptor limit which isn't as tight as yours (1024?) in many systems. That's probably why I have never seen it so far.

Rahix commented 1 year ago

Reproduces well for me!

ulimit -H -n 1024
python3 bug.py
# bug.py
import os
import tbot
from tbot.machine import board, connector

class EvalBoard(connector.SubprocessConnector, board.Board):
    pass

@tbot.testcase
def test_fds():
    tbot.log.message(os.getpid())
    for i in range(2000):
        with EvalBoard() as b:
            tbot.log.message(f"RUN {i}")
            b.exec("echo", "Hello tbot!")

if __name__ == "__main__":
    test_fds()
Output
[...]
├─RUN 1014
├─RUN 1015
├─RUN 1016
├─RUN 1017
└─Fail. (21.718s)
Traceback (most recent call last):
  File "/run/user/1000/Temporary/bug.py", line 21, in <module>
    test_fds()
  File "/home/rahix/Documents/Development/tbot/tbot/decorators.py", line 62, in wrapped
    return tc(*args, **kwargs)
  File "/run/user/1000/Temporary/bug.py", line 15, in test_fds
    with EvalBoard() as b:
  File "/home/rahix/Documents/Development/tbot/tbot/machine/machine.py", line 158, in __enter__
    self.ch = self._cx.enter_context(self._connect())
  File "/home/rahix/Documents/Development/tbot/tbot/machine/connector/common.py", line 62, in _connect
    return channel.SubprocessChannel()
  File "/home/rahix/Documents/Development/tbot/tbot/machine/channel/subprocess.py", line 142, in __init__
    super().__init__(SubprocessChannelIO())
  File "/home/rahix/Documents/Development/tbot/tbot/machine/channel/subprocess.py", line 41, in __init__
    self.p = subprocess.Popen(
  File "/usr/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.10/subprocess.py", line 1746, in _execute_child
    errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files
mebel-christ commented 1 year ago

Great, I will post the PR ASAP.

Keep in mind that hitting the bug depends on your local system's file descriptor limit which isn't as tight as yours (1024?) in many systems.

TBH I really wonder why this never became a problem up until now. Since you use select() in the SubprocessChannelIO which in turn passes down to the Syscall select that (according to this man page) can only ever handle file-descriptors smaller than 1024.