eclipse-cyclonedds / cyclonedds

Eclipse Cyclone DDS project
https://projects.eclipse.org/projects/iot.cyclonedds
Other
885 stars 362 forks source link

Communication problems between two dds demo running on two computers #1699

Closed aohwang closed 1 year ago

aohwang commented 1 year ago

Hi friends,

I have using CycloneDDS in several projects, and it behaves excellently. In the recent days, I tried release version 0.10.2 in my testing environment, but I couldn't make it work.

There are two nodes in my testing environment, including one Win 10 desktop (192.168.4.23) and embedded development board (192.168.4.127).

The testing programs are simple, the program running on Win 10 desktop is sending samples repeatedly and the program running on the embedded development board is just receiving messages.

I enabled the finest log level in the two programs and captured the network traffic using wireshark. After I walk around the logs and traffic, I found some strange behaviours.

The program running on node 192.168.4.127 is tring to write sedp messages, however no packets are sent to the network. You can see it from the picture. image

Most of the packets captured are from 192.168.4.23 to 192.168.4.127. The packets sent from 192.168.4.127 are just SPDP packets and some Heartbeat and ACKNACK messages.

Could you give me some suggestions about how to fix the problem? Anything would be appreciated.

Please remove the ".txt" suffix of the rtps_capture.pcapng.txt and open it with wireshark.

node1_192.168.4.127.txt node2_192.168.4.23.txt rtps_capture.pcapng.txt

eboasson commented 1 year ago

Something strange is going on indeed. If I am not mistaken, the following packets in the wireshark capture:

image

correspond to the ones logged by the recv thread in the trace on 192.168.4.127:

grep 'recv: HDR' ~/Downloads/node1_192.168.4.127.txt
        49.116666 [0]       recv: HDR(1103f97:947c09cb:79f7726d vendor 1.16) len 408 from udp/192.168.4.23:50851
        49.500000 [0]       recv: HDR(11071ce:22a6170:c777bba1 vendor 1.16) len 620 from udp/192.168.4.127:58979
        49.733333 [0]       recv: HDR(1103f97:947c09cb:79f7726d vendor 1.16) len 184 from udp/192.168.4.23:50851
        50.533333 [0]       recv: HDR(1103f97:947c09cb:79f7726d vendor 1.16) len 68 from udp/192.168.4.23:50851
        50.816666 [0]       recv: HDR(1103f97:947c09cb:79f7726d vendor 1.16) len 544 from udp/192.168.4.23:50851
        51.133333 [0]       recv: HDR(1103f97:947c09cb:79f7726d vendor 1.16) len 100 from udp/192.168.4.23:50851
        51.300000 [0]       recv: HDR(1103f97:947c09cb:79f7726d vendor 1.16) len 68 from udp/192.168.4.23:50851
        51.433333 [0]       recv: HDR(1103f97:947c09cb:79f7726d vendor 1.16) len 100 from udp/192.168.4.23:50851
        51.650000 [0]       recv: HDR(1103f97:947c09cb:79f7726d vendor 1.16) len 68 from udp/192.168.4.23:50851
        51.800000 [0]       recv: HDR(1103f97:947c09cb:79f7726d vendor 1.16) len 100 from udp/192.168.4.23:50851

(Look at the sizes of the UDP payload) The odd thing is that the timing according to 192.168.4.127 is very different than that according to the wireshark capture. So I am tempted to think that something fishy is going on with the clock.

Something else is strange about that SEDP message as well. There are HEARTBEAT messages informing the 192.168.4.23 that it exists (packet 49 in the capture, for example). This immediately results for a request for a retransmit (packet 50):

image

But it is nowhere to be seen in the log file. It does try again, but no response.

I'm wondering if it has something to do with the apparent inability to keep up. There is a failure mode where the tev thread is completely occupied handling timed events and not getting around to sending messages. I've only seen it in a case where events were rescheduled 1ns into what used to be the future (https://github.com/eclipse-cyclonedds/cyclonedds/pull/1611) but I suppose if it can't keep up at all, it could be possible.

If so, disabling tracing and using a release build might help. (It could/should be rewritten to avoid starvation, too.)

But it really is just guesswork on my side.

aohwang commented 1 year ago

Thanks @eboasson , I have solved the problem.

The first odd thing you mentioned is about the time inconsistency. It's because the time synchronization in the embedded development board is not working.

You remind me that the tev thread may be not able to keep up. The operating system running on the board is VxWorks. I checked the CPU usage and found that it's fully occupied by one task, so the tev thread gets no chance to run. After I stopped the task, all things worked. Thanks for your detailed response. :)

eboasson commented 1 year ago

Thanks for letting me know @aohwang, happy that it got solved so easily.

And thank you for confirming that it (still) works on VxWorks as well ☺️