Open APCBoston opened 1 year ago
We do pass a timeout to paramiko's transport.open_session()
, but I'm still seeing it not time out when it should...
The code you linked there seems to be the opposite, it sets the timeout very low.
One point of confusion here, where you're expecting to see something timeout is it because the TCP session dies or something else? I would assume that setting timeouts on the socket only affects TCP-level stuckness. But if the SSH server or NETCONF server itself took a long time to reply, that wouldn't get caught by socket-level timeouts.
I could be wrong there, I haven't done socket programming in a while. But that's how I'd expect socket-level timeouts to work.
To elaborate further: if it's e.g. the NETCONF server getting stuck at the NC protocol level, the TCP stream will still get ACK'd by the kernel and as far as TCP is concerned everything is fine. Eventually if the underlying service isn't grabbing stuff off the socket, the socket's buffer might fill up and then the kernel will stop ACK'ing stuff, but that might not happen if it's a low-volume connection.
@JennToo the operative bit that caught my eye in the linked code is that the socket timeout exception is never propagated.
What I'm seeing in my testing is that whether I kill the NETCONF server while keeping the TCP socket open or take the device down entirely, subsequent NETCONF RPCs hang indefinitely.
It looks like this is because (and I'm sort of live-blogging my troubleshooting in this thread) Transport.open_channel()
accepts a timeout
, but only uses it for the initial connection. After that, its default behavior is a blocking socket with no timeout. (See here, and here ).
Oh that's interesting. Are you using the ncclient-adapter manager from https://github.com/ADTRAN/netconf_client/blob/main/netconf_client/ncclient.py ?
We're setting a timeout there on the futures returned by this library's session handler, and that should actually be catching this too. That one should be catching a timeout regardless of which protocol layer things get stuck at. There might be a bug with the timeout logic in this library though.
Yes we are using the manager, but it's not catching it. I believe the reason is related to the fact that paramiko
is spawning threads rather than using asyncio
futures, but I haven't looked at that bit of the code in detail (yet). I think I should be able to have a patch out later today regardless.
I did a re-read of how we do the timeout logic in the manager object and I didn't spot any obvious bugs. It is a little complicated though, so there could certainly be something wrong with it.
We actually spawn a thread too, all the NC-level protocol stuff happens on that thread and the future is what the client thread is using.
I guess it is possible though that something (possibly paramiko) is making a blocking call into some C thing with the GIL held. That would prevent the other python threads from running
My hypothesis is that there's a simple solution of calling channel.settimeout(general_timeout)
in connect_ssh
. Testing that now...
Quick update: channel.settimeout()
didn't hurt anything, but I ultimately discovered that a timeout error was not my friction point. Instead, I have a much odder problem, one that none of the Python experts in my own network can make heads or tails of, and that (I say this advisedly) could be a bug in the Python interpreter itself (CPython 3.10, CPython 3.11).
The thing I ran into is described below, though it is not -this- issue and may not be a netconf_client
issue at all, I'm including it here for reference and in case anybody coming across this has the answer. I'm inclined to leave this issue open until I've generated a socket timeout and watched it get handled correctly or incorrectly, which is in my near-term (4-8 week) backlog.
Wow! Yeah that's very strange indeed.
Some day I'd like for this library to be properly AIO-aware and compatible. It was written well before that stuff got standardized, but long term it'd be good to just make it async-native. I guess we'd also need (or at least want) an async-friendly SSH library too though or it'd be a bit moot.
Realistically it'd be nearly a rewrite though, and at least for the way our company uses this library (mainly just for integration testing), it probably won't get priority any time soon.
I'll change this bug's title to reflect what you found and leave it open, just in case anyone else ever stumbles into this. But it sounds like there's not much we can do to fix it within this library.
Okay, this came back up for me earlier this month and after a lot of digging I believe I have found the ur-source of the problem at https://github.com/ADTRAN/netconf_client/blob/280d9d6e19828ae7c96d359ee3e2729b44e63a48/netconf_client/session.py#L107C1-L115C1.
Things I have learned:
Session._recv_loop()
attempts to read from a socket that is associated with a dead device, that socket raises an exception. The Session
emits an info
level log and the thread managing the recv_loop
exits, but the exception is not propagated. Session
or paramiko
and the Session
object still exists, it will continue accepting new RPC requests (and other method calls)sendall
call if the TCP socket attached to the dead device times out repeatedly. https://github.com/ADTRAN/netconf_client/blob/280d9d6e19828ae7c96d359ee3e2729b44e63a48/netconf_client/session.py#L107C1-L115C1In my case, I found that an adequate workaround was to call Session.thread.is_alive()
before making a periodic RPC to check if the device is still there.
See Paramiko Transport constructor at:
https://github.com/paramiko/paramiko/blob/main/paramiko/transport.py#LL454-L457
The result of this is that netconf_client ssh connections that should time out are susceptible to hang forever instead.
Working on further investigation/resolution, putting this here for situational awareness.