ARM-software / devlib

Library for interaction with and instrumentation of remote devices.
Apache License 2.0
47 stars 78 forks source link

utils.ssh.SshConnectionBase issues caused by garbage collection #684

Closed ep1cman closed 3 months ago

ep1cman commented 5 months ago

I am hitting a rather odd issue where at the end of my scripts I am seeing a flood of the following messages:

Exception ignored in: <function ConnectionBase.__del__ at 0x1062e1a20>
Traceback (most recent call last):
  File "/Users/sebgos01/Library/Caches/pypoetry/virtualenvs/appalib-JvmmYHe4-py3.10/lib/python3.10/site-packages/devlib/utils/misc.py", line 961, in wrapper
  File "/Users/sebgos01/Library/Caches/pypoetry/virtualenvs/appalib-JvmmYHe4-py3.10/lib/python3.10/site-packages/devlib/connection.py", line 121, in __del__
  File "/Users/sebgos01/Library/Caches/pypoetry/virtualenvs/appalib-JvmmYHe4-py3.10/lib/python3.10/site-packages/devlib/utils/misc.py", line 961, in wrapper
  File "/Users/sebgos01/Library/Caches/pypoetry/virtualenvs/appalib-JvmmYHe4-py3.10/lib/python3.10/site-packages/devlib/connection.py", line 111, in close
  File "/Users/sebgos01/Library/Caches/pypoetry/virtualenvs/appalib-JvmmYHe4-py3.10/lib/python3.10/site-packages/devlib/utils/misc.py", line 961, in wrapper
  File "/Users/sebgos01/Library/Caches/pypoetry/virtualenvs/appalib-JvmmYHe4-py3.10/lib/python3.10/site-packages/devlib/utils/ssh.py", line 725, in _close
AttributeError: 'NoneType' object has no attribute 'debug'

Both logger and _handle_paramiko_exceptions no longer seem to exist.

What I think is happening is that during interpreter shut down these have already been garbage collected, and when it comes time to garbage collect ConnectionBase, __del__ is called but ends up being unable to run.

marcbonnici commented 5 months ago

@douglas-raillard-arm Have you seen this behavior on your setup?

douglas-raillard-arm commented 5 months ago

I haven't seen it at that specific location but in general yes, that makes sense. That's partly why relying on __del__ is a bad idea and context managers should be used.

What we should do is :

  1. Provide the context manager
  2. Maybe silently ignore all exceptions in del. It's not ideal as it might hide other bugs, but not much other choice. We could still log something at debug level

EDIT: I lied apparently and did hit the exact same problem: https://gitlab.arm.com/tooling/lisa/-/blob/main/lisa/target.py?ref_type=heads#L1309 https://gitlab.arm.com/tooling/lisa/-/blob/main/lisa/_cli_tools/lisa_load_kmod.py?ref_type=heads#L45

douglas-raillard-arm commented 5 months ago

Actually there might be a 3rd way out to ensure __del__ is ran before the namespace starts being torn down. I'll give that a shot today and report back.

douglas-raillard-arm commented 5 months ago

@ep1cman Could you give a go at that PR ? It introduces a context manager API for Target so that you can use:

with Target(...) as target:
   ...

And the errors should be gone.

If it works for you I'll remove the draft status.

EDIT: PR is updated to also avoid the problem when the context manager API is not used by using an atexit handler.

ep1cman commented 5 months ago

Unfortunately I am unable to currently re-create the race condition with the garbage collector that caused this error

douglas-raillard-arm commented 5 months ago

@ep1cman were you using a version of devlib published before or after March 29th, 1:04 AM GMT ?

ep1cman commented 5 months ago

Currently pip freeze shows: devlib @ git+https://github.com/ARM-software/devlib.git@7276097d4e12ff2b3cfa1bb0ba40cee24ae3372b (The latest commit). I believe this is what I was using when reporting this but I couldn't say for sure if I had tried to update in the process of debugging this.

douglas-raillard-arm commented 5 months ago

Ok, I'd say we can merge https://github.com/ARM-software/devlib/pull/685 and it's likely to fix the issue, but maybe what you experienced came from something not covered, as this PR should have fixed the issue for SshConnection specifically.

https://github.com/ARM-software/devlib/pull/685 should fix it for every connection type, and for any other resource held by Target so it's more robust.

Does you script use multithreading explicitly ? (i.e. more than just what devlib does internally)

ep1cman commented 5 months ago

The code I am developing mixes devlib with something called "labgrid" than fires off its own asyncio loop, multiple sub processes too so it wouldn't surprise me if that's what made the race condition pop up

douglas-raillard-arm commented 5 months ago

Ok maybe it's partially related then, if anything would prevent the atexit handler to execute. According to the Python doc:

Note: The functions registered via this module are not called when the program is killed by a signal not handled by Python, when a Python fatal internal error is detected, or when os._exit() is called.

I can't really imagine that situation being your case though. In any case, the context manager API should provide a way to deal with that cleanly, and in current devlib codebase, calling Target.disconnect().