eclipse-cyclonedds / cyclonedds

Eclipse Cyclone DDS project
https://projects.eclipse.org/projects/iot.cyclonedds
Other
854 stars 352 forks source link

ddsi_udp_conn_write to udp failed with retcode -5 #2031

Closed nabetetsu closed 3 months ago

nabetetsu commented 3 months ago

Hi, when running ROS Humble application with cyclonedds, I get the following error.

<process name>: ddsi_udp_conn_write to udp/192.168.10.21:59746 failed with retcode -5
<process name>: ddsi_udp_conn_write to udp/192.168.10.21:59746 failed with retcode -5
<process name>: ddsi_udp_conn_write to udp/192.168.10.21:59746 failed with retcode -5
....

I suspected an insufficient send buffer size based on this error code and changed from "default" to 10MB in the configuration file, and the error stopped occurring.

Note: When monitoring the usage of the send/receive buffer with the ss -pm command, but both Send-Q and Recv-Q did not increase from 0.

Question: I would like to tune the send buffer size to 10MB as t is too large. Are there any best practices for setting up an appropriate send buffer size ? (and also receive buffer size)

eboasson commented 3 months ago

Hi @nabetetsu, I guess you also went from -5 is DDS_RETCODE_OUT_OF_RESOURCES and in this case that means sendmsg returned either ENOBUFS or ENOMEM, and that it therefore looked like the send buffer was too small. Somehow.

The thing is that I can't remember actually having seen this error myself on Linux (I think I have seen it on macOS once or twice). The default socket send buffer is small, and by default the sendmsg call blocks until the network has accepted the data. The mere fact you're trying to send some data doesn't usually cause that. Maybe it happens when there are many processes writing at the same time, from what I remember that seemed to play a role on macOS. For me, that time, it was all on one machine and I suspected the many-to-one packets over the loopback interface to be the trigger, but it is really just a guess.

If ss -pm doesn't tell anything, then I guess one has to fly blind.

For the send buffer, I always think it is better if it is kept small, otherwise all you're getting is extra latency. Cyclone can handle the packet drops, so it should work fine despite these errors. That means it might actually be best to not print anything when it happens. It already suppresses some error codes that I learnt I have to expect at https://github.com/eclipse-cyclonedds/cyclonedds/blob/15b3a09b6346c5adc3599bc19aad269f6eb5fd6f/src/core/ddsi/src/ddsi_udp.c#L280 (it would be better to record the fact that it happened it make it visible via some interface, but that takes some work).

For the receive buffer, it depends. If you want to receive large samples without packet drops, then I would recommend making it large enough to hold the largest sample (plus some "reasonable" overhead). If you have multiple sources sending them at the same time, you might have to make it a bit larger still.

If you can afford losing some packets and recovering via retransmits, then Cyclone generally does a decent job even when the receive buffer is much smaller, because it will take the actual receive buffer size into account in deciding how much to retransmit in one go and so it generally avoids overflowing the buffer with the retransmits. Still, you lose bandwidth (to packets lost in the initial transmission) and you incur the latency of some round-trips and some scheduling/processing overhead.

Most of the time, a large receive buffer is fine. If the receive processing can keep up nearly all the time, you won't introduce excessive latency in the buffer, and Linux (and macOS and Windows) only allocates memory that you actually use. If you have 100 processes and 10MB socket receive buffers, you have worst-case of 4GB or so (4 sockets per process), which is still manageable on many machines, but in practice 3 of those sockets will not be handling that much traffic and so won't be using all that memory anyway.

Are there any best practices for setting up an appropriate send buffer size ? (and also receive buffer size)

Not really ... but perhaps the above helps you a bit in deciding.

nabetetsu commented 3 months ago

Hi @eboasson, It was very helpful to hear your insights. As you guessed, I assumed from DDS_RETCODE_OUT_OF_RESOURCES error code that sendmsg() actually resulted in ENOBUFS or ENOMEM error.

I guessed that sendmsg() was running non-blocking and the corresponding error occurred because the send buffer was overflowing, but I guess I was wrong in my assumption.

For the send buffer, I always think it is better if it is kept small, otherwise all you're getting is extra latency. Cyclone can handle the packet drops, so it should work fine despite these errors. That means it might actually be best to not print anything when it happens. It already suppresses some error codes that I learnt I have to expect at

I agree with your opinion. If it is going to output an error, it might as well be when the retransmission process at Cyclone layer finally fails and discards the packets. Due of Cyclone's retransmission process, it seems that I do not have to watch the output of retcode -5 errors, but since they occur several times per second in my case, I will consider the send buffer size needed to suppress error outputs.

So far, in my case, the problem has not occurred because I have set the receive buffer large enough. However, as you said, if the problem occurs on the receiving side, I recognized that the problem shold be that the receiving process is not keeping up with the receiving buffer size, rather than reviewing buffer size.