ADTRAN / netconf_client

A NETCONF client for Python 3.8+
https://adtran.github.io/netconf_client/
Other
15 stars 18 forks source link

Exceptions in Socket _recv_loop may cause thread to die silently #28

Open APCBoston opened 1 year ago

APCBoston commented 1 year ago

See Paramiko Transport constructor at:

https://github.com/paramiko/paramiko/blob/main/paramiko/transport.py#LL454-L457

The result of this is that netconf_client ssh connections that should time out are susceptible to hang forever instead.

Working on further investigation/resolution, putting this here for situational awareness.

APCBoston commented 1 year ago

We do pass a timeout to paramiko's transport.open_session(), but I'm still seeing it not time out when it should...

JennToo commented 1 year ago

The code you linked there seems to be the opposite, it sets the timeout very low.

One point of confusion here, where you're expecting to see something timeout is it because the TCP session dies or something else? I would assume that setting timeouts on the socket only affects TCP-level stuckness. But if the SSH server or NETCONF server itself took a long time to reply, that wouldn't get caught by socket-level timeouts.

I could be wrong there, I haven't done socket programming in a while. But that's how I'd expect socket-level timeouts to work.

JennToo commented 1 year ago

To elaborate further: if it's e.g. the NETCONF server getting stuck at the NC protocol level, the TCP stream will still get ACK'd by the kernel and as far as TCP is concerned everything is fine. Eventually if the underlying service isn't grabbing stuff off the socket, the socket's buffer might fill up and then the kernel will stop ACK'ing stuff, but that might not happen if it's a low-volume connection.

APCBoston commented 1 year ago

@JennToo the operative bit that caught my eye in the linked code is that the socket timeout exception is never propagated.

What I'm seeing in my testing is that whether I kill the NETCONF server while keeping the TCP socket open or take the device down entirely, subsequent NETCONF RPCs hang indefinitely.

It looks like this is because (and I'm sort of live-blogging my troubleshooting in this thread) Transport.open_channel() accepts a timeout, but only uses it for the initial connection. After that, its default behavior is a blocking socket with no timeout. (See here, and here ).

JennToo commented 1 year ago

Oh that's interesting. Are you using the ncclient-adapter manager from https://github.com/ADTRAN/netconf_client/blob/main/netconf_client/ncclient.py ?

We're setting a timeout there on the futures returned by this library's session handler, and that should actually be catching this too. That one should be catching a timeout regardless of which protocol layer things get stuck at. There might be a bug with the timeout logic in this library though.

APCBoston commented 1 year ago

Yes we are using the manager, but it's not catching it. I believe the reason is related to the fact that paramiko is spawning threads rather than using asyncio futures, but I haven't looked at that bit of the code in detail (yet). I think I should be able to have a patch out later today regardless.

JennToo commented 1 year ago

I did a re-read of how we do the timeout logic in the manager object and I didn't spot any obvious bugs. It is a little complicated though, so there could certainly be something wrong with it.

JennToo commented 1 year ago

We actually spawn a thread too, all the NC-level protocol stuff happens on that thread and the future is what the client thread is using.

I guess it is possible though that something (possibly paramiko) is making a blocking call into some C thing with the GIL held. That would prevent the other python threads from running

APCBoston commented 1 year ago

My hypothesis is that there's a simple solution of calling channel.settimeout(general_timeout) in connect_ssh. Testing that now...

APCBoston commented 1 year ago

Quick update: channel.settimeout() didn't hurt anything, but I ultimately discovered that a timeout error was not my friction point. Instead, I have a much odder problem, one that none of the Python experts in my own network can make heads or tails of, and that (I say this advisedly) could be a bug in the Python interpreter itself (CPython 3.10, CPython 3.11).

The thing I ran into is described below, though it is not -this- issue and may not be a netconf_client issue at all, I'm including it here for reference and in case anybody coming across this has the answer. I'm inclined to leave this issue open until I've generated a socket timeout and watched it get handled correctly or incorrectly, which is in my near-term (4-8 week) backlog.

Weird error details below the cut I had an `asyncio` coroutine calling `netconf_client` 's `Manager.dispatch()` method to send an RPC. This in turn filters down through a series of `netconf_client` and `paramiko` calls to a `paramiko` `Transport` object, which does the socket-level stuff in a separate thread. (Transport in fact subclasses `threading.Thread`). `netconf_client` creates a concurrent.futures.Future (which I have learned is similar but not the same as an AIO Future) to await the response from the device. I found that when the device was dead, the following were true: 1. Paramiko instantly raised a `socket.error` exception and passed it back to `netconf_client`. 2. `netconf_client` caught the exception, logged, and re-raised 3. If I provoked the behavior through the REPL or pdb debugging session, that same exception would immediately propagate up to that REPL or debugger session, where I'd see it in the terminal. You know, like you expect exceptions to do. Stack trace and all. (I have since learned that `asyncio` provides a _separate_ async REPL. Haven't tried this yet) 4. HOWEVER, if the same code was called from an aio coroutine, that coroutine would hang forever. If I subsequently broke in with the debugger, I could see an event loop churning forever, doing nothing. 5. BUT WAIT, THERE's MORE: when I added an explicit handler for that specific exception (`except sock.error:` ... ), in the coroutine function, the behavior returned to normal , the exception was caught and handled per that code block, etc etc Thanks to #5, I got my code to work, but it bugs me that I don't understand in the slightest why 3-5 happen the way they do.
JennToo commented 1 year ago

Wow! Yeah that's very strange indeed.

Some day I'd like for this library to be properly AIO-aware and compatible. It was written well before that stuff got standardized, but long term it'd be good to just make it async-native. I guess we'd also need (or at least want) an async-friendly SSH library too though or it'd be a bit moot.

Realistically it'd be nearly a rewrite though, and at least for the way our company uses this library (mainly just for integration testing), it probably won't get priority any time soon.

I'll change this bug's title to reflect what you found and leave it open, just in case anyone else ever stumbles into this. But it sounds like there's not much we can do to fix it within this library.

APCBoston commented 1 year ago

Okay, this came back up for me earlier this month and after a lot of digging I believe I have found the ur-source of the problem at https://github.com/ADTRAN/netconf_client/blob/280d9d6e19828ae7c96d359ee3e2729b44e63a48/netconf_client/session.py#L107C1-L115C1.

Things I have learned:

In my case, I found that an adequate workaround was to call Session.thread.is_alive() before making a periodic RPC to check if the device is still there.