Rahix / tbot

Automation/Testing tool for Embedded Linux Development
https://tbot.tools
GNU General Public License v3.0
84 stars 21 forks source link

Test case with picocom channel fails with "some subprocess(es) did not stop" #116

Closed VictorEmelin closed 1 month ago

VictorEmelin commented 2 months ago

Very simple test case fails with error:

$ newbot ts.hello_world.hi -c config.temp_config tbot starting ... ├─Calling hi ... │ ├─[local] picocom -b 115200 /dev/ttyUSB1 │ ├─[dm-board-linux] echo Hello World │ │ ## Hello World │ ├─test case finished │ └─Fail. (1.717s) ├─Warning: Found dangling <class 'config.temp_config.DmBoardLinux'> instance in this context ├─Warning: Found dangling <class 'config.temp_config.DmBoard'> instance in this context ├─Warning: Teardown went wrong! A <class 'config.temp_config.DmBoardLinux'> instance is still alive. │ Please report this to https://github.com/rahix/tbot/issues! ├─Warning: Teardown went wrong! A <class 'config.temp_config.DmBoard'> instance is still alive. │ Please report this to https://github.com/rahix/tbot/issues! ├─Exception: │ Traceback (most recent call last): │ File "/usr/lib/python3.8/contextlib.py", line 510, in exit │ if cb(*exc_details): │ File "/usr/lib/python3.8/contextlib.py", line 120, in exit │ next(self.gen) │ File "/home/vemelin/.local/lib/python3.8/site-packages/tbot/machine/connector/common.py", line 132, in _connect │ yield ch │ File "/home/vemelin/.local/lib/python3.8/site-packages/tbot/machine/channel/channel.py", line 561, in exit │ self.close() │ File "/home/vemelin/.local/lib/python3.8/site-packages/tbot/machine/channel/channel.py", line 538, in close │ self._c.close() │ File "/home/vemelin/.local/lib/python3.8/site-packages/tbot/machine/channel/subprocess.py", line 130, in close │ raise tbot.error.TbotException("some subprocess(es) did not stop") │ tbot.error.TbotException: some subprocess(es) did not stop ├───────────────────────────────────────── └─FAILURE (1.789s)

code hello_world.py:

import tbot

@tbot.testcase
def hi():
    with tbot.ctx.request(tbot.role.BoardLinux) as lnx:
        lnx.exec0("echo",
                  "Hello",
                  "World")
        tbot.log.message(f"test case finished")

code temp_config.py:

import tbot
from tbot.machine import connector, board, linux

class DmBoard(connector.ConsoleConnector, board.Board):
    baudrate = 115200
    serial_port = "/dev/ttyUSB1"

    def connect(self, mach):
        return mach.open_channel("picocom", "-b", str(self.baudrate), self.serial_port)

class DmBoardLinux(board.Connector, linux.Ash):
    pass

def register_machines(ctx: tbot.Context):
    ctx.register(DmBoard, tbot.role.Board)
    ctx.register(DmBoardLinux, tbot.role.BoardLinux)
VictorEmelin commented 2 months ago

You need to fix it by increasing wait time in tbot/tbot/machine/channel/subprocess.py line 118 Why did you limit exponential backoff to 1 sec? It could be greater without affecting anything else For me for t in range(11): works perfect. range(7) not enough.

Rahix commented 2 months ago

Hi,

the backoff limit was intentionally chosen to be quite low because usually, when subprocesses don't stop, the correct solution is to add code that cleanly terminates the offending processes. In your case, the solution is probably to send a termination escape sequence to picocom:

    def connect(self, mach):
        ch = mach.open_channel("picocom", "-b", str(self.baudrate), self.serial_port)
        try:
          yield ch
        finally:
          ch.sendcontrol("A")
          ch.sendcontrol("Q")

That said, I'd be open to extending the backoff time a bit but displaying a warning after one second that something isn't right.

VictorEmelin commented 2 months ago

@Rahix thanks for your suggestion (just contextlib.contextmanager decorator is needed to add). It works, but it works even this way:

    @contextlib.contextmanager
    def connect(self, mach):
        ch = mach.open_channel("picocom", "-b", str(self.baudrate), self.serial_port)
        yield ch

Didn't dive deep to it. And in both cases it can raise tbot.error.ChannelClosedError if spam this test very often. Maybe it's not important but increasing backoff time doesn't have this problem (for t in range(8): was enough for me)

So maybe worth it to extend backoff time or to add the trick above to documentation. Up to you.

Rahix commented 1 month ago

but it works even this way:

Yeah, it is a timing race between tbot's timeout and your picocom process stopping. It will work some days and some days it will fail.

To avoid ugly situations like this, I have taken great care to not make use of timeouts in tbot whereever possible. You ran into one of the few remaining ones...

I have now created #120 to hopefully improve on this. Please let me know if this works for you.

or to add the trick above to documentation

So I think the real solution here is to add a dedicated picocom connector to tbot, which implements this behavior instead of just documenting what people need to do.

And actually, this picocom connector already exists - in a way: The AutoConsoleConnector from tbot_contrib includes support for picocom, using the exact code I showed above. But as I just noticed, this connector never made it into the documentation :face_with_peeking_eye: Need to fix that asap...

Rahix commented 1 month ago

AutoConsoleConnector is documented now: https://tbot.tools/contrib/connector.html#tbot_contrib.connector.auto.AutoConsoleConnector