gazebosim / gz-transport

Transport library for component communication based on publication/subscription and service calls.
https://gazebosim.org
Apache License 2.0
29 stars 44 forks source link

"Host unreachable" error message in service server when the client times out #243

Open peci1 opened 3 years ago

peci1 commented 3 years ago

Environment

Description

I get spurious errors when calling services:

error sending response: Host unreachable

The whole simulation runs on localhost, so it's weird it says host unreachable. Maybe the network card gets micro-resets and when ZMQ hits exactly the micro-reset, it fails sending? (dmesg doesn't show any micro-resets)

I think the service replier should try harder, with e.g. 3 attempts, around the lines

https://github.com/ignitionrobotics/ign-transport/blob/133dcf0379e1d6231e30dc6cb0bf1692fe13c4f1/src/NodeShared.cc#L787-L850

peci1 commented 3 years ago

From time to time, I see this in dmesg:

TCP: request_sock_TCP: Possible SYN flooding on port 43893. Sending cookies.  Check SNMP counters.

However, it's not too frequent and I'm unable to say whether it is from Gazebo or some other traffic...

peci1 commented 3 years ago

I tried the suggestions from https://access.redhat.com/solutions/30453 and they do not help preventing these SYN flood messages.

peci1 commented 3 years ago

I added some debug prints when it happens.

My test application is SubT artifact validator with some custom nodes running around:

https://github.com/osrf/subt/blob/5023446a033ab6710e966a3ce0582288ed143b11/subt_ign/src/ArtifactValidator.cc#L382

This happens (sometimes) when I call the move_to service:

[Msg] Moving to: backpack_2
        * Connected to [tcp://127.0.0.1:40005] for sending a response
0
err try 0 res 0
err try 1 res 0
err try 2 res 0
err try 3 res 0
err try 4 res 0
1
NodeShared::RecvSrvRequest() error sending response: Host unreachable host tcp://127.0.0.1:40005

You can see I even ran this instance with IGN_IP=127.0.0.1 to rule out any network adapter issues. It did not help. You can also see in the log that it executed 5 retries of the first send (my custom debugging attempt), each with a 1 second delay. It apparently does not help. This is really weird as other networking services are not interrupted (I'm doing it over SSH and no problem). Can it be some ZeroMQ resource exhaustion?

peci1 commented 3 years ago

Ahh, I think I got it. The host unreachable error on the service server side means that the client timed out before the service finished and closed the connection. It is kind of expected that the server will have no endpoint to send the response to, but the error message is quite confusing, as the host is pretty normally reachable, but the port is not. The error text however seems to be coming from ZeroMQ, which, in turn, translates it from the EHOSTUNREACH code returned from socket syscall.

I'm not really sure if something can be done about this, but a better error message would definitely be welcome.

peci1 commented 3 years ago

Steps to reproduce:

  1. add std::this_thread::sleep_for(std::chrono::milliseconds(10000)); to https://github.com/ignitionrobotics/ign-transport/blob/3ae2925d2a5c6c2b57662d37a64b678ff6af722c/example/responser.cc#L29 .
  2. Build examples
  3. run ./responser
  4. run ./requester

Requester has timeout 5 secs, responser will take 10 secs with the change from step 1.

peci1 commented 3 years ago

Oh no, the spurious errors are back now that I've increased all timeouts to 100 seconds (which is way more than should be needed). I've seen it happen that the service finishes in about 1 second (I have a log print right before return true in the callback) and then I get the "host unreachable" error even though the client is still there waiting for 99 more seconds... So back to the beginning :(

peci1 commented 3 years ago

I've tried to workaround the service call by a pair of topics, and I had a problem even with those. A lot of messages get lost. So I forced them to repeat (both request messages and response messages) and now it works with a few re-sends on some service calls.

What's interesting is that some subscribers get into an invalid state after some time. I'm spawning ign topic -e -t ... -n 1 via subprocess.call from Python. Sometimes, this gets into an invalid state, so that the call does not exit even though messages are published to the topic and another ign topic -e echoes the messages correctly. Either I'm doing Gazebo something really unpleasant, or there is something really wrong in the transport library. This might be the same problem as SubT sees with breadcrumb dropping commands being lost sometimes.

peci1 commented 3 years ago

Here's an example of topic statistics when I substituted the service with repeated topics:

stats.txt

I see lost messages being reported there.

The publishers in this case are ign topic --pub calls from Python via subprocess.check_output().

caguero commented 3 years ago

Out of curiosity, are you publishing messages very fast? It's expected to loose messages if there's a fast publisher or a slow subscriber because their buffers will overflow. That's a parameter that we can set in Ignition via environment variables. See https://ignitionrobotics.org/api/transport/10.0/envvars.html (IGN_TRANSPORT_RCVHWM and IGN_TRANSPORT_SNDHWM). Maybe for this test both parameters should be set to 0 (infinite) to rule out buffer overflow situations.

peci1 commented 3 years ago

It depends on what you mean by fast. I have 20 hi-res cameras running, each publishing RGB and Depth at about 0.3 Hz (so it's rather large than fast). Simulation is running almost realtime because I'm using a triggered sensors system instead of the continuously producing one. It is the trigger commands for the customized sensor system that get lost often, but I send these only at 0.3 Hz (sometimes also set_pose service calls get lost). So the only fast topics should be transforms (probably at 250 Hz).

Thanks for pointing out the watermarks, I haven't known about them. I'll give them a try.

peci1 commented 3 years ago

I still see dropped messages even with infinite buffers (in the topic-based workaround).

peci1 commented 3 years ago

I even tried lowering the RTF of the simulation, but even that did not help (though the part calling the triggered sensors system is RTF-independent and goes as fast as the sensors render). Dropped messages are still there.

How is it with ign-transport's threading model? Is it like in ROS, that each subscriber can only process one task at a time? And the per-subscriber queue size from ROS is substituted by the per-process queue configured by watermarks? My trigger-controlling callbacks do not block (as well as set_pose if setting a value to a ECM component is not blocking), so it should be no problem, but maybe detailed knowledge of the threading model could help here...

caguero commented 3 years ago

There's a dedicated thread in NodeShared.cc running the RunReceptionTask() function in an infinite loop. This thread is pretty much waiting on a poll() call that wakes up each time there's activity in some of the relevant zmq sockets. Depending on the socket, we'll decide if this is a msg update, a service request or an incoming service response. As an example, if there's activity on the msg socket, we parse all the fields from the socket, unserialize the message, and execute all the subscriber callbacks. This is all happening in this dedicated thread, not in a user thread.

peci1 commented 3 years ago

Thanks for the explanation, you could turn it into a tutorial =) Or even better - there should really be a tutorial explaining the differences in topic/service concepts between Ignition and ROS, because this isn't the first time I was confused that something that seems like a ROS concept basically is very different.

If I got it correctly, NodeShared is a per-process singleton. So there's no way of getting multiple independent callback queues like in ROS? Or is it just the deserialization that happens in a single thread and the callbacks are processed on some queues?

If there's really only a single queue for everything, then I see that my original concept with a service call triggering a rendering loop which takes about 2 seconds wasn't really the best idea... Is there a proper way around it? Does it mean blocking services are basically forbidden? I really don't like solutions of type "let's reimplement TCP functionality on a UDP layer"... (which is what I tried with the topic-based service call workaround).

caguero commented 3 years ago

We're mixing a lot of issues in this ticket :) but is there any way you could share with me an IgnitionTransport-only example where the messages are dropped?

peci1 commented 3 years ago

Yes, we are, but I thought these information could help me understand the problem better.

Do you have an idea on how to convert the whole Gazebo launch file to a transport-only example? Would Transport Log help here, or does it not record service calls?

caguero commented 3 years ago

Logs don't record service calls and pretty much anything that let us narrow the issue without the extra complexity of SubT or Gazebo in our way will help.