esitarski / CrossMgr

Cyclo Cross Management Application
MIT License
40 stars 20 forks source link

CrossMgrImpinj connection to CrossMgr can freeze when restarting RIFD test too quicky #95

Closed kimble4 closed 1 year ago

kimble4 commented 1 year ago

Description Very occasionally, if you stop and re-start the RFID test too quickly, CrossMgrImpinj gets into a wedged state where it stops communicating with CrossMgr. (Communication with the tag reader continues normally.) The frozen connection does not time out, even after CrossMgr is closed. If you restart CrossMgrImpinj, everything works normally.

Steps to reproduce the behavior Start CrossMgrImpinj and CrossMgr. Perform an RFID test. Repeatedly stop and restart the RFID test until CrossMjrImpinj's right hand window shows a frozen handshake with CrossMgr. Note that nothing you can do from the CrossMgr side restores communication with CrossMgrImpinj.

Expected behavior CrossMgrImpinj should be able to determine the connection is frozen, close and re-attempt to establish the connection to CrossMgr. Failing that, CrossMgr should not allow you to re-start the RFID test immediately after stopping one.

Screenshots See video capture here: https://youtu.be/ATKShKZLIzc I manage to get it to freeze at 32 seconds in.

Desktop (please complete the following information): Windows 11 CrossMgrImpinj 3.0.16 CrossMgr 3.1.05

Additional context This occurred at last weekend's BHPC event, we suspect due to an accidental double-click on the "Stop RFID Test" button. Due to Murphy's Law, as everyone who understands the system was in the race, an unskilled volunteer started the race in CrossMgr, and did not realise there was a problem for about 20 minutes.

(We were running the previous version of CrossMgr/CrossMgrImpinj, but I've upgraded and confirmed the bug persists in the current release.)

esitarski commented 1 year ago

Thanks for this. After re-reading the documentation, I came across this:

Timeouts and the connect method The connect() operation is also subject to the timeout setting, and in general it is recommended to call settimeout() before calling connect() or pass a timeout parameter to create_connection(). However, the system network stack may also return a connection timeout error of its own regardless of any Python socket timeout setting.

Sure enough, I was setting the timeout after the connect call, not before. This caused the connect call to block, which is what I believe is causing the freeze. The timeout is now set before the connect call. That way, if CrossMgr fails to respond to the connection, CrossMgrImpinj will try again in 2 sec. Yeesh. Socket programming is hard.

On Fri, Sep 23, 2022 at 11:37 AM Kim Wall @.***> wrote:

Description Very occasionally, if you stop and re-start the RFID test too quickly, CrossMgrImpinj gets into a wedged state where it stops communicating with CrossMgr. (Communication with the tag reader continues normally.) The frozen connection does not time out, even after CrossMgr is closed. If you restart CrossMgrImpinj, everything works normally.

Steps to reproduce the behavior Start CrossMgrImpinj and CrossMgr. Perform an RFID test. Repeatedly stop and restart the RFID test until CrossMjrImpinj's right hand window shows a frozen handshake with CrossMgr. Note that nothing you can do from the CrossMgr side restores communication with CrossMgrImpinj.

Expected behavior CrossMgrImpinj should be able to determine the connection is frozen, close and re-attempt to establish the connection to CrossMgr. Failing that, CrossMgr should not allow you to re-start the RFID test immediately after stopping one.

Screenshots See video capture here: https://youtu.be/ATKShKZLIzc I manage to get it to freeze at 32 seconds in.

Desktop (please complete the following information): Windows 11 CrossMgrImpinj 3.0.16 CrossMgr 3.1.05

Additional context This occurred at last weekend's BHPC event, we suspect due to an accidental double-click on the "Stop RFID Test" button. Due to Murphy's Law, as everyone who understands the system was in the race, an unskilled volunteer started the race in CrossMgr, and did not realise there was a problem for about 20 minutes.

(We were running the previous version of CrossMgr/CrossMgrImpinj, but I've upgraded and confirmed the bug persists in the current release.)

— Reply to this email directly, view it on GitHub https://github.com/esitarski/CrossMgr/issues/95, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGXKMDUAOXBXCFFXQ6SV3V7XFFLANCNFSM6AAAAAAQUC6YJQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Edward Sitarski

kimble4 commented 1 year ago

I've done a quck test (without a tag reader connected) using CrossMgrImpinj_3_0_16-20220927155532.exe and it seems to think it's still connected to CrossMgr even after CrossMgr is closed, which doesn't seem right.

I'll test more thoroughly when I've got time to dig the tag reader out and set it up.

kimble4 commented 1 year ago

Okay, tried CrossMgrImpinj 3.0.16-20220927155532 with the tag reader connected. Again, repeated restarting of the RIFD test causes CrossMgrImpinj to freeze while the connection is initialised, which then does not time out and reconnect:

https://www.youtube.com/watch?v=qjPKpJApzLo

In the absence of this freeze condition, it correctly times out the connection if CrossMgr goes away, but I note that it only seems to notice the connection is dead when a tag read comes in:

https://www.youtube.com/watch?v=W1kF2lZf7Ug

(If no tag reads arrive, as when I posted the comment above, it appears that the connection never times out.)

Could this be a threading issue, perhaps?

esitarski commented 1 year ago

Case 1:

I enhanced the code to handle all possible exceptions with the CrossMgr socket, not just timeouts.

Perhaps the turn/on/off behavior caused an unhandled exception, crashing the IO thread in CrossMgrImpinj.

I am finding this very difficult to reproduce as the CrossMgr socket has to be shut down during the handshake (my computer is very fast?).

The update should be available in about 10 minutes.

Case 2:

Yes, CrossMgrImpinj only realizes CrossMgr is no longer alive when a tag send fails. There isn't a heartbeat check. I will give it some thought.

On Thu, Sep 29, 2022 at 1:28 PM Kim Wall @.***> wrote:

Okay, tried CrossMgrImpinj 3.0.16-20220927155532 with the tag reader connected. Again, repeated restarting of the RIFD test causes CrossMgrImpinj to freeze while the connection is initialised, which then does not time out and reconnect:

https://www.youtube.com/watch?v=qjPKpJApzLo

In the absence of this freeze condition, it correctly times out the connection if CrossMgr goes away, but I note that it only seems to notice the connection is dead when a tag read comes in:

https://www.youtube.com/watch?v=W1kF2lZf7Ug

Could this be a threading issue, perhaps?

— Reply to this email directly, view it on GitHub https://github.com/esitarski/CrossMgr/issues/95#issuecomment-1262587887, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGXKM37K4UCJCHK4TLRN3WAXGTHANCNFSM6AAAAAAQUC6YJQ . You are receiving this because you commented.Message ID: @.***>

--

Edward Sitarski

kimble4 commented 1 year ago

This seems a lot better, when the socket closes during the handshake:

CrossMgr Connection succeeded!
Sending identifier "BHPC-Timing-11584"...
CrossMgr error: [WinError 10054] An existing connection was forcibly closed by the remote host.
Trying to connect to CrossMgr at 127.0.0.1:53135 as "BHPC-Timing-11584"...

And:

CrossMgr Connection succeeded!
Sending identifier "BHPC-Timing-11584"...
Waiting for GT (get time) command from CrossMgr...
CrossMgr error: [WinError 10054] An existing connection was forcibly closed by the remote host.
Trying to connect to CrossMgr at 127.0.0.1:53135 as "BHPC-Timing-11584"...

Whereas a graceful shutdown gets:

CrossMgr error: [WinError 10053] An established connection was aborted by the software in your host machine.

rather than the 10054 error above, for reasons that probably make sense to Windows (I assume depending on whether the socket closure handshake completes properly), so it would appear that it was an unhandled exception.

I've been clicking the start/stop button like a maniac for about 15 minutes now, successfully interrupting the handshake several times, and haven't had it freeze at all, so that looks like it's fixed.

The lack of a heartbeat is mostly a UI issue at this point, as it will reconnect when tag reads arrive. I suppose that might be more important if you're running CrossMgrImpinj on a different machine to CrossMgr (for multiple start/finish lines or whatever). I work on the principle that you can usually trust localhost to work :)

esitarski commented 1 year ago

Clearly, Windows throws at least two exceptions (10053 and 10054) when a connection is dropped. I was only handling one, which caused the thread to exit when it got the other one. Handling all exceptions is the way to go - it's anyone's guess what other exceptions Windows is capable of ;)

Thanks for all your help to get this resolved and making the software more robust.

On Fri, Sep 30, 2022 at 10:34 AM Kim Wall @.***> wrote:

This seems a lot better, when the socket closes during the handshake:

CrossMgr Connection succeeded! Sending identifier "BHPC-Timing-11584"... CrossMgr error: [WinError 10054] An existing connection was forcibly closed by the remote host. Trying to connect to CrossMgr at 127.0.0.1:53135 as "BHPC-Timing-11584"...

And:

CrossMgr Connection succeeded! Sending identifier "BHPC-Timing-11584"... Waiting for GT (get time) command from CrossMgr... CrossMgr error: [WinError 10054] An existing connection was forcibly closed by the remote host. Trying to connect to CrossMgr at 127.0.0.1:53135 as "BHPC-Timing-11584"...

Whereas a graceful shutdown gets:

CrossMgr error: [WinError 10053] An established connection was aborted by the software in your host machine.

rather than the 10054 error above, for reasons that probably make sense to Windows (I assume depending on whether the socket closure handshake completes properly), so it would appear that it was an unhandled exception.

I've been clicking the start/stop button like a maniac for about 15 minutes now, successfully interrupting the handshake several times, and haven't had it freeze at all, so that looks like it's fixed.

The lack of a heartbeat is mostly a UI issue at this point, as it will reconnect when tag reads arrive.

— Reply to this email directly, view it on GitHub https://github.com/esitarski/CrossMgr/issues/95#issuecomment-1263656317, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGXKIUZOPDKKKDJIPMBSDWA327DANCNFSM6AAAAAAQUC6YJQ . You are receiving this because you commented.Message ID: @.***>

--

Edward Sitarski