RobotWebTools / rosbridge_suite

Server Implementations of the rosbridge v2 Protocol
https://robotwebtools.github.io
BSD 3-Clause "New" or "Revised" License
897 stars 514 forks source link

RosbridgeProtocol instance clean-up hangs when client disconnects under specific conditions. #891

Open ramlab-jose opened 10 months ago

ramlab-jose commented 10 months ago

Description Under specific conditions that I have not yet been able to pin down a client disconnection will not be gracefully handled, leading to the server attempting to forward messages to that (closed) client's websocket and thus spamming errors. This also leads to a leakage in resources and eventual lock up of the rosbridge process.

The problem seems to come from this last part of the Protocol.incoming function. After adding a bunch of logging it seems that this blocks the IncomingQueue.run loop and thus the protocol.finish of a given client is never triggered.

As mentioned I have yet to find a minimal way to reproduce the problem, but we have frequently encountered this when there are rapid connections/disconnections happening and the rosbridge instance is under load.

I was able to "fix" the problem by improving the behavior regarding the remaining message that is kept in self.buffer here but would like some input on why this is here and how we can fix it properly.

Thanks in advance, and I believe this could explain some of the other issues that went stale in the past.

Steps To Reproduce I have yet to find a reliable way of reproducing the problem, but from my experience the following conditions seem to trigger the problem:

Expected Behavior A client disconnecting should always result in the respective RosbridgeProtocol instance (and respective Capabilities) being cleaned up.

Actual Behavior Sometimes the clean up (.finish) seems to hang and resources remain being used against a closed websocket.

daisukes commented 3 months ago

I encountered the same issue and might have found the cause. This loop can keep sending the first element of the queued messages.

https://github.com/RobotWebTools/rosbridge_suite/blob/7d78af16d30d0ffe232abcc65d0928ce90bd61f7/rosbridge_library/src/rosbridge_library/internal/subscription_modifiers.py#L164-L168

ramlab-jose commented 3 months ago

Hi @daisukes, at the time I managed to reduce the occurrence of this problem, although I never found a way to consistently reproduce it. See this commit for my approach (admittedly not the cleanest).

daisukes commented 3 months ago

Hi @ramlab-jose

Thanks, I will try it as well!

I had very similar issues with my application, which shows a map, laser scans, and the robot's position by using TF, as described in the first description of your comment.

Under specific conditions that I have not yet been able to pin down a client disconnection will not be gracefully handled, leading to the server attempting to forward messages to that (closed) client's websocket and thus spamming errors. This also leads to a leakage in resources and eventual lock up of the rosbridge process.

I could reproduce it by quickly refreshing my app, but I did not always get the exact symptoms.

I found that when it happens, python threads get stuck in infinite loops at two places so far.