Rahix / tbot

Automation/Testing tool for Embedded Linux Development
https://tbot.tools
GNU General Public License v3.0
84 stars 21 forks source link

Processes not being terminated after SSH connection #89

Closed mebel-christ closed 1 year ago

mebel-christ commented 1 year ago

Even after #88 there are still some file-descriptors that appear to remain open (even though they should be closed) when I connect to a DUT with the SSHConnector. I use this setup:

config/eval.py:

from tbot import role
from tbot.machine import board, connector, linux
from tbot.machine.linux import auth

class EvalBoard_Linux(connector.SSHConnector, linux.Bash):

    @property
    def authenticator(self) -> auth.Authenticator:
        return auth.PasswordAuthenticator("<password>")

    @property
    def hostname(self) -> str:
        return "<ip-address>"

    @property
    def username(self) -> str:
        return "<username>"

def register_machines(ctx):
    ctx.register(EvalBoard_Linux, role.BoardLinux)

tc/test_fd.py:

import os
import tbot
import time

from tbot import role, log

@tbot.testcase
def test_fds():
    log.message(f"tbot PID: {os.getpid()}")
    time.sleep(5)

    for i in range(5):
        log.message(f"RUN {i}")
        with tbot.ctx.request(role.BoardLinux) as lnx:
            lnx.exec0("echo", "Hello Linux!")
        log.message(f'Open FDs: {sorted(set(os.listdir("/proc/self/fd")))}')

I execute the setup without any additional configuration like this: newbot -c config.eval tc.test_fd.test_fds This is the output:

tbot starting ...
├─Calling test_fds ...
│   ├─tbot PID: 6006
│   ├─RUN 0
│   ├─[local] sshpass -p <password> ssh -p 22 <username>@<ip-address>
│   ├─[eval-board_-linux] echo 'Hello Linux!'
│   │    ## Hello Linux!
│   ├─Open FDs: ['0', '1', '103', '11', '2', '26', '27', '28', '29', '3', '30', '39', '4', '6', '7']
│   ├─RUN 1
│   ├─[local] sshpass -p <password> ssh -p 22 <username>@<ip-address>
│   ├─[eval-board_-linux] echo 'Hello Linux!'
│   │    ## Hello Linux!
│   ├─Open FDs: ['0', '1', '103', '11', '2', '26', '27', '28', '29', '3', '30', '39', '4', '6', '7', '8', '9']
│   ├─RUN 2
│   ├─[local] sshpass -p <password> ssh -p 22 <username>@<ip-address>
│   ├─[eval-board_-linux] echo 'Hello Linux!'
│   │    ## Hello Linux!
│   ├─Open FDs: ['0', '1', '10', '103', '11', '12', '2', '26', '27', '28', '29', '3', '30', '39', '4', '6', '7', '8', '9']
│   ├─RUN 3
│   ├─[local] sshpass -p <password> ssh -p 22 <username>@<ip-address>
│   ├─[eval-board_-linux] echo 'Hello Linux!'
│   │    ## Hello Linux!
│   ├─Open FDs: ['0', '1', '10', '103', '11', '12', '13', '14', '2', '26', '27', '28', '29', '3', '30', '39', '4', '6', '7', '8', '9']
│   ├─RUN 4
│   ├─[local] sshpass -p <password> ssh -p 22 <username>@<ip-address>
│   ├─[eval-board_-linux] echo 'Hello Linux!'
│   │    ## Hello Linux!
│   ├─Open FDs: ['0', '1', '10', '103', '11', '12', '13', '14', '15', '16', '2', '26', '27', '28', '29', '3', '30', '39', '4', '6', '7', '8', '9']
│   └─Done. (8.901s)
├─────────────────────────────────────────
└─SUCCESS (8.934s)

This time it seems that some processes remain active, after the connection is closed. htop shows this: tbot

@Rahix can you confirm this happens as well on your end?

Proposed solution

Change the yield step in the current _connect() method of SSHConnector like this:

_ch = h.ch.take()
yield _ch
_ch.close()

This properly closes all remaining channels / processes and results in the following test run:

tbot starting ...
├─Calling test_fds ...
│   ├─tbot PID: 9308
│   ├─RUN 0
│   ├─[local] sshpass -p <password> ssh -p 22 <username>@<ip-address>
│   ├─[eval-board_-linux] echo 'Hello Linux!'
│   │    ## Hello Linux!
│   ├─Open FDs: ['0', '1', '103', '11', '2', '26', '27', '28', '29', '3', '31', '33', '35', '39', '4']
│   ├─RUN 1
│   ├─[local] sshpass -p <password> ssh -p 22 <username>@<ip-address>
│   ├─[eval-board_-linux] echo 'Hello Linux!'
│   │    ## Hello Linux!
│   ├─Open FDs: ['0', '1', '103', '11', '2', '26', '27', '28', '29', '3', '31', '33', '35', '39', '4']
│   ├─RUN 2
│   ├─[local] sshpass -p <password> ssh -p 22 <username>@<ip-address>
│   ├─[eval-board_-linux] echo 'Hello Linux!'
│   │    ## Hello Linux!
│   ├─Open FDs: ['0', '1', '103', '11', '2', '26', '27', '28', '29', '3', '31', '33', '35', '39', '4']
│   ├─RUN 3
│   ├─[local] sshpass -p <password> ssh -p 22 <username>@<ip-address>
│   │    ## Hello Linux!
│   ├─Open FDs: ['0', '1', '103', '11', '2', '26', '27', '28', '29', '3', '31', '33', '35', '39', '4']
│   ├─RUN 4
│   ├─[local] sshpass -p <password> ssh -p 22 <username>@<ip-address>
│   ├─[eval-board_-linux] echo 'Hello Linux!'
│   │    ## Hello Linux!
│   ├─Open FDs: ['0', '1', '103', '11', '2', '26', '27', '28', '29', '3', '31', '33', '35', '39', '4']
│   └─Done. (8.902s)
├─────────────────────────────────────────
└─SUCCESS (8.937s)
Rahix commented 1 year ago

Ah, yeah, after "taking" a channel, you are responsible for its lifecycle. And SSHConnector doesn't properly take care of its channel here. Your suggestion is close, but what we should really do is this:

with h.ch.take() as ch:
    yield ch

this has the benefit that it also closes the channel in case of an exception. Your code would still leak the channel in such situations.

That said, looking at the code, I wonder why it isn't using the h.open_channel() method. It looks almost identical to the code in ssh.py but comes with a few more features/safeguards. I think this change should work:

with h.open_channel(
    *cmd,
    *hk_disable,
    *multiplexing,
    *["-p", str(self.port)],
    *[arg for opt in self.ssh_config for arg in ["-o", opt]],
    f"{self.username}@{self.hostname}",
) as ch:
    yield ch
mebel-christ commented 1 year ago

Ah, great solution, I totally missed that. I just tried the open_channel method and it seems to integrate very well.

Additionally a quick question about this: https://github.com/Rahix/tbot/blob/6fbc41003b159286b983107ea60477918fda6470/tbot/machine/connector/ssh.py#L166-L170

The "modern" solution would be this right?

with tbot.ctx() as cx:
    if self.host is None:
        self.host = cx.request(tbot.role.LabHost)
    h = cx.enter_context(self.host.clone())

Is it possible to modernize this? I would supply both changes in one PR since they change the same function or should this remain unchanged?

Rahix commented 1 year ago

Should be tbot.role.LocalHost, but I think you are right. I'm always a bit scared about anything involving the old tbot.selectable interface as I don't know how many people still rely on it in some way and it is super fragile... But we can't support legacy stuff forever so if this does cause a breakage, people either need to speak up or update to the new interface ^^ So I'd say go ahead and do this change, too :)

mebel-christ commented 1 year ago

Just curious, why would you go with LocalHost instead of LabHost? By default it should be the same anyway and if one uses a remote Lab it would automagically work as well right? Or is this to fragile?

Rahix commented 1 year ago

Just curious, why would you go with LocalHost instead of LabHost? By default it should be the same anyway and if one uses a remote Lab it would automagically work as well right? Or is this to fragile?

Okay, this entire situation is a big mess. Actually, the from_context() constructor is supposed to pass along the lab-host in 99% of cases so your suggested automagically correct host is exactly what I had in mind there. The local-host here is just the fall back for sitations where a machine is instantiated outside of the context and as such a lab-host may not exist or is not wanted. However, I believe this "choose lab-host" codepath is currently broken - still need to verify.

Moving a bit further back, this plays into the bigger issue of seamless integration between machines and the context. This is one of the warts that still need to be solved, the decision of what host an ssh connection should originate from. I think, in the end, the solution is to give the SSHConnector a new configuration attribute which holds a role to use, defaulting to LabHost + magic (see f9f2fe8a2512). Still don't feel entirely comfortable with it yet, though, as I first want to more clearly define the interaction of machines with the context...

Rahix commented 1 year ago

Also see #92.