M-Welsch / backup-server

Backup Server (BaSe)
Apache License 2.0
3 stars 1 forks source link

/dev/ttyBASEPCU not always available when needed #37

Closed M-Welsch closed 7 months ago

M-Welsch commented 7 months ago

Describe the bug

see below

Expected behavior

/dev/ttyBASEPCU should be available when needed

Actual behavior

bcu software sometimes crashes when trying to open the serial terminal on /dev/ttyBASEPCU. Happens around every 34st run.

What happens if we don't solve it (aka why is it important)

the bcu cannot engage the hdd when PCU is not available. Therefore the thing cannot do a backup

To Reproduce

Steps to reproduce the behavior:

doing test which just starts bcu, tries the handshake, shuts down again, sleeps for a minute and so on

Additional context, Environment

during #31

Describe/define the problem

Develop Interim Containment Plan (if necessary)

Determine Root Causes and Escape Points

Pointer to the solution

Actions to prevent recurrence or solve systematic problems

Description

M-Welsch commented 7 months ago

couldn't reproduce in 28 runs with code

async def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--no-shutdown", default=False, required=False)
    parser.add_argument("--config", default="config.yaml", type=str, required=False)
    args = parser.parse_args()
    cfg = load_config(Path(args.config))
    LOG.info(f"loading config file {args.config}")
    await init(cfg["logger"])
    # await engage()
    # await backup(cfg["backup"])
    # await disengage()
    # await wait_before_shutdown(cfg)
    await set_wakeup_time(timedelta(minutes=1))
    if not args.no_shutdown:
        await shutdown()
M-Welsch commented 7 months ago

happened again 24.1.24

Jan 24 03:41:15 basehw4sn2 sudo[880]:     base : PWD=/home/base/backup-server/software/bcu ; USER=root ; COMMAND=/sbin/shutdown -h now
Jan 24 03:41:15 basehw4sn2 sudo[880]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1001)
Jan 24 03:41:15 basehw4sn2 sudo[880]: pam_unix(sudo:session): session closed for user root
-- Boot f0ce1e227d6d4d019ab2e95155c4143f --
Jan 24 03:41:21 basehw4sn2 python3[439]: Traceback (most recent call last):
Jan 24 03:41:21 basehw4sn2 python3[439]:   File "/home/base/backup-server/software/bcu/venv/lib/python3.11/site-packages/serial/serialposix.py", line 322, in open
Jan 24 03:41:22 basehw4sn2 python3[439]:     self.fd = os.open(self.portstr, os.O_RDWR | os.O_NOCTTY | os.O_NONBLOCK)
Jan 24 03:41:22 basehw4sn2 python3[439]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 24 03:41:22 basehw4sn2 python3[439]: FileNotFoundError: [Errno 2] No such file or directory: '/dev/ttyBASEPCU'
Jan 24 03:41:22 basehw4sn2 python3[439]: During handling of the above exception, another exception occurred:
Jan 24 03:41:22 basehw4sn2 python3[439]: Traceback (most recent call last):
Jan 24 03:41:22 basehw4sn2 python3[439]:   File "<frozen runpy>", line 198, in _run_module_as_main
Jan 24 03:41:22 basehw4sn2 python3[439]:   File "<frozen runpy>", line 88, in _run_code
Jan 24 03:41:22 basehw4sn2 python3[439]:   File "/home/base/backup-server/software/bcu/__main__.py", line 181, in <module>
Jan 24 03:41:22 basehw4sn2 python3[439]:     asyncio.run(main())
Jan 24 03:41:22 basehw4sn2 python3[439]:   File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run
Jan 24 03:41:22 basehw4sn2 python3[439]:     return runner.run(main)
Jan 24 03:41:22 basehw4sn2 python3[439]:            ^^^^^^^^^^^^^^^^
Jan 24 03:41:22 basehw4sn2 python3[439]:   File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
Jan 24 03:41:22 basehw4sn2 python3[439]:     return self._loop.run_until_complete(task)
Jan 24 03:41:22 basehw4sn2 python3[439]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 24 03:41:22 basehw4sn2 python3[439]:   File "/usr/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
Jan 24 03:41:22 basehw4sn2 python3[439]:     return future.result()
Jan 24 03:41:22 basehw4sn2 python3[439]:            ^^^^^^^^^^^^^^^
Jan 24 03:41:22 basehw4sn2 python3[439]:   File "/home/base/backup-server/software/bcu/__main__.py", line 170, in main
Jan 24 03:41:22 basehw4sn2 python3[439]:     await init(cfg["logger"])
Jan 24 03:41:22 basehw4sn2 python3[439]:   File "/home/base/backup-server/software/bcu/__main__.py", line 44, in init
Jan 24 03:41:22 basehw4sn2 python3[439]:     await pcu.handshake()
Jan 24 03:41:22 basehw4sn2 python3[439]:   File "/home/base/backup-server/software/bcu/pcu.py", line 212, in handshake
Jan 24 03:41:22 basehw4sn2 python3[439]:     while not (response := await _probe()) == 'Echo':
Jan 24 03:41:22 basehw4sn2 python3[439]:                            ^^^^^^^^^^^^^^
Jan 24 03:41:22 basehw4sn2 python3[439]:   File "/home/base/backup-server/software/bcu/pcu.py", line 205, in _probe
Jan 24 03:41:22 basehw4sn2 python3[439]:     return await call_pcu("probe")
Jan 24 03:41:22 basehw4sn2 python3[439]:            ^^^^^^^^^^^^^^^^^^^^^^^
Jan 24 03:41:22 basehw4sn2 python3[439]:   File "/home/base/backup-server/software/bcu/pcu.py", line 193, in call_pcu
Jan 24 03:41:22 basehw4sn2 python3[439]:     with Serial("/dev/ttyBASEPCU", baudrate=38400, timeout=1) as ser:  # timeout is critical
Jan 24 03:41:22 basehw4sn2 python3[439]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 24 03:41:22 basehw4sn2 python3[439]:   File "/home/base/backup-server/software/bcu/venv/lib/python3.11/site-packages/serial/serialutil.py", line 244, in __init__
Jan 24 03:41:22 basehw4sn2 python3[439]:     self.open()
Jan 24 03:41:22 basehw4sn2 python3[439]:   File "/home/base/backup-server/software/bcu/venv/lib/python3.11/site-packages/serial/serialposix.py", line 325, in open
Jan 24 03:41:22 basehw4sn2 python3[439]:     raise SerialException(msg.errno, "could not open port {}: {}".format(self._port, msg))
Jan 24 03:41:22 basehw4sn2 python3[439]: serial.serialutil.SerialException: [Errno 2] could not open port /dev/ttyBASEPCU: [Errno 2] No such file or directory: '/dev/ttyBASEPCU'
M-Welsch commented 7 months ago

https://github.com/M-Welsch/backup-server/blob/511d455c000a05693782568a211b4c451cc8cc38/software/bcu/pcu.py#L211C1-L220C1

add check for device node and log of warning

modified the config.yaml (production file! Because it's easier ..) to use backup_testdata_source and sleep for only 1 minute.

Results

206 runs, the following line appeared 2 times

/dev/ttyBASEPCU not found, retrying ...

since the backup-server kept working, this retrying approach works