dds lost packets when it transferred large data locally, and sometimes the subscriber would not receive the data anymore

Gummum commented 6 months ago

Embedded devices, DDS_RELIABILITY_BEST_EFFORT This strategy, transmission is similar to the original data of the image, when the subscriber is more, there will be packet loss, and sometimes will never receive the data, I check through gdb, the receiving thread is stuck on the select call

Gummum commented 6 months ago

I may not have the ability to find out the problem, so I came to ask for some advice.

eboasson commented 6 months ago

Hi @Gummum, yes packet loss is the big issue with "best-effort"[^1].

I am not sure what you mean by "when the subscriber is more", it can happen always and is more likely to happen for bigger samples. Perhaps you meant when you create more subscribers? That could be: that usually causes Cyclone to switch to multicast, and there are many network switches that are more likely to drop multicast than unicast.

So what can you do? Not so much ...

What you can do is to try to find out where the packet loss occurs. Is it in the switch, the physical network (multicast on WiFi is notorious!) or in the socket receive buffer? Wireshark can help, netstat -s (look for UDP errors) on Linux is great for the socket receive buffer overruns. If it is the use of multicast, you can disable that in the Cyclone configuration. (Even for specific topics.)

If you see socket buffer overruns, then increasing the size of them will probably help. It is a known problem when the data size is similar to or larger than the socket receive buffer. The many packets that make up the data of a large sample are sent in a quick burst, and so if the receiving thread can't keep up (or is a bit late in starting) you can overrun a small socket buffer. (Linux has a default maximum of about 400kB, Cyclone by default asks for 1MB, but by default accepts whatever it gets.)

Hopefully this helps a bit. If the loss can't be fixed, then there are still interesting options but they involve distributing the image data over many small samples, and building the processing in such a way that it treats the missing pixels as something like noise. That works nicely if you make sure it is always some other bunch of pixels that are missing ... But that's a very different subject.

[^1]: I don't understand why it is called "best-effort", unless a marketing department got involved at some point. It is "unreliable" or "send-and-forget", definitely not what I consider "best-effort" 🙂

Gummum commented 6 months ago

Thank you very much for your answer. "when the subscriber is more" means that I open multiple subscription processes to receive the same topic. I use dds for local inter-process, so I don't think it should lose packets, or never receive data. I set a 2M buffer for cyclonedds. I think this setting should be large enough.I guess it's partly due to cpu scheduling, because after I killed the main process on the device, I started many processes to subscribe to the same topic without any problems.

Scout22 commented 2 months ago

Try to increase the receiving buffer in the kernel. From ros documentation these are the line to be used: https://docs.ros.org/en/iron/How-To-Guides/DDS-tuning.html

echo 'net.core.rmem_max=2147483647' | sudo tee -a /etc/sysctl.d/10-dds.conf
echo 'net.ipv4.ipfrag_high_thresh=134217728' | sudo tee -a /etc/sysctl.d/10-dds.conf
echo 'net.ipv4.ipfrag_time=3' | sudo tee -a /etc/sysctl.d/10-dds.conf
sudo sysctl -p /etc/sysctl.d/10-k1-dds.conf

eclipse-cyclonedds / cyclonedds

dds lost packets when it transferred large data locally, and sometimes the subscriber would not receive the data anymore #1993