RobotWebTools / rosbridge_suite

Server Implementations of the rosbridge v2 Protocol
https://robotwebtools.github.io
BSD 3-Clause "New" or "Revised" License
866 stars 506 forks source link

Timeout in service call_service if no response within server_timeout_time #905

Closed michael-cmt closed 5 months ago

michael-cmt commented 5 months ago

Public API Changes N/A

Description Currently, if you attempt to call a service that has no active servers, then the client call will hang indefinitely blocking other calls. This does not appear to follow the intent of the timeout available.

You can test it by creating a client to a random service, e.g.: ros2 service call /Foo std_srvs/SetBool '{data: True}'

Which will result in this "service" being listed even though there are no servers: ros2 service list returns: /Foo

Then you call the websocket on the /Foo service (python)

#!/usr/bin/env python    

from websockets.sync.client import connect    

def hello():    
    with connect("ws://localhost:9090") as websocket:    
        websocket.send("{ \"op\": \"call_service\",\"service\": \"/Foo\"}")          
        print("SENT");    
        message = websocket.recv()       
        print(f"Received: {message}")    

hello()

Before the change, this call will hang indefinitely. After the change, this will return after the timeout period has elapsed.

sea-bass commented 5 months ago

Oh and also, there is a launch parameter in the webserver named call_services_in_new_thread to prevent such hanging. See https://github.com/RobotWebTools/rosbridge_suite/pull/847

Maybe you want to enable this flag to unblock you from your core issue?

michael-cmt commented 5 months ago

So the reason it shows up as listed but is not available is that for whatever reason in ros2 if you make a client to a service, regardless of if any server is actually hosting the service, then it becomes a listed service. So in my case I had a node running that created a client to a well known path, and the node for the service wasn't online. You can simulate this behavior as described above:

in one terminal, run this to create a client to a non-existant service: ros2 service call /Foo std_srvs/SetBool '{data: True}'

in another terminal you can see the service listed: ros2 service list will show /Foo

My first attempt at fixing this involved checking if any servers are available for the listed service, but that requires a function in rclpy (Node::count_services) that is only available in the rolling branch: https://github.com/ros2/rclpy/commit/e3d37c5ac2f4f6e8fe5b63d780cc43c7c9867c58

But really, it makes no sense to me that call_service would be allowed to hang indefinitely, this blocks calls to other services. So I think this is probably a better fix regardless.

It seems odd to me to just put it in another thread as well, because then when a service appears some time later down the road, the intent of that service call is probably no longer valid. This could cause unsafe behavior. For instance, let's imagine a node has crashed and is restarting, which takes some time. A user tries to send a command, but it doesn't go through, so the bridge sits on it queued up. The user goes over to inspect the robot which isn't responding. Then a non-trivial amount of time later, the service appears, and suddenly the robot actuates the command. This doesn't seem like the intended behavior.

michael-cmt commented 5 months ago

Actually, to your point about wait_for_service, the return value isn't being checked. Let me see what that does...

sea-bass commented 5 months ago

Actually, to your point about wait_for_service, the return value isn't being checked. Let me see what that does...

oh wow, yeah, if all it takes is checking whether that is False and breaking out of the function, that would be great (and a hilarious bug)

michael-cmt commented 5 months ago

Yep, I think that will fix it, standby...

michael-cmt commented 5 months ago

OK, this seems to fix it, and I removed my other timeout logic.

However, I'm a belts & suspenders kind of guy, and we're dealing with asynchronous systems, so I'd probably vote to include that timeout logic I just removed regardless assuming I can convince you :D. The reason is that if in between the wait_for_service check and the actual call the service goes down, then you could end up with an edge case of waiting eternally. What are your thoughts?

sea-bass commented 5 months ago

Fantastic -- great find!

I agree it would be nice to time out if the server goes down mid-execution, but we probably need to do our due diligence in ensuring we aren't timing people out when they actually want to wait for a long-running service to respond.

I think minimally to get that working, we'll want a configurable parameter for service response timeouts, and defaulting to something way bigger than 1 second. Meaning, a separate parameter from the server timeout.

Would probably be better suited for a follow-on PR, if you're amenable to try that out?

michael-cmt commented 5 months ago

Yes, that makes sense to me! I like the idea of the user having the ability to configure a call timeout separately.