unexpected behavior - Githubissues

eclipse-cyclonedds / cyclonedds-cxx

Other

95 stars 75 forks source link

unexpected behavior #190

Open zephod77 opened 2 years ago

zephod77 commented 2 years ago

I took the helloworld example and modified it to use my own idl and then changed the publisher to loop to publish 5 messages and the subscriber to listen forever. I start the subscriber in one window and it waits for messages. I start the publisher in another window and it publishes 5 messages, including passing writer.publication_matched_status().current_count() > 0 but the subscriber never sees any data.

Then I modified my publisher to add a 50ms wait at the end of the loop. This time it publishes 1 message and then waits in the current_count() > 0 loop forever the seconds time around. The subscriber now sees one message and prints out the expected values.

Can someone explain what is happening here? Is the subscriber code not set up to receive more that 1 message and why does adding a delay to the publisher loop make a difference to the subscriber?

eboasson commented 2 years ago

The subscriber has the default QoS, which means it has a "keep last (with depth) 1" history. "Keep last N" means that it will never store more than N samples for a given key value (= "instance", which is a term you will find in DDS all the time). Once there are N samples in there and another one arrives, the oldest is discarded to make room. (And yes, this is DDS so of course there are exceptions ... for example when notifying the application that there are no writers for that instance anymore.)

If you publish 5 samples in quick succession, there is a good chance that the 5 arrive in between two take operations, and especially so if you poll the reader once every 20ms. If instead of polling you use a waitset or a listener to get a trigger, the likelihood of seeing all of them increases, but you still don't get a guarantee. If you increase the history depth, this changes, and if you must see all samples that are delivered to the reader history cache, you need to use a "keep all" setting. And beware: I write "all samples delivered to the reader history cache", rather simply "all samples" because the writer also has a history cache with a history QoS.

Furthermore, and again because the subscriber has the default QoS, the reader is a best-effort reader. The "keep last 1" default is certainly understandable for global data space and for real-time control systems; the choice to make readers best-effort by default is, I think, an inane choice. But it is the spec, and deviating from the spec for default QoS settings that all DDS implementations have adhered to for over 15 years seems a recipe for confusion as well ...

Now, you're on loopback (so no packet loss with a tiny amount of data that easily fits in the socket buffers), and besides you waited with publishing data until the writer had discovered the reader. In other words, why would the reader not receive the data?

The reason is that "discovery" is not an indivisible process. Each process publishes which readers and writers it has using two built-in data writers and subscribes to the other processes' readers and writers using two built-in data readers. These built-in endpoints are all transient-local endpoints, so that the data describing the already existing endpoints are also communicated to newly discovered participants. But all 4 discovery flows (R,W from A to B; and R,W from B to A; you only have 2 in effect because you only have a reader in one and a writer in the other) operate independently and so it is quite possible that the publishing application discovers the reader before the subscribing application discovers the writer. If the data is sent during this window, the receiver will drop the data because it comes from an unknown writer and therefore can't be interpreted. If the reader is reliable, retransmits will take care of it for the data still present in the writer history cache (which is keep-last-1 by default!), but if the reader is best-effort, the data will never arrive.

The only bit I find surprising is that it would be so reproducible. Discovery happens quite quickly for small systems like this, and there are sleeps everywhere. Still, it is definitely possible, and since I have no reason to suspect the underlying mechanisms, this is what I expect is happening. And if I am right in that, changing the QoS to be reliable, keep-last-5 should make all the difference. It is quite terribly that the examples don't contain an example for that ... but you can copy it from e.g. https://github.com/eboasson/i11eperf/blob/dd79e487bc33c072604abcf0df02881dffbedafa/src/csub.cpp#L68.

zephod77 commented 2 years ago

Thanks for the explanation. I have it working now. I think the main problem was that I was not understanding what writer.publication_matched_status().current_count() does. I thought it was the number of readers that had read the message whereas it seems that it is the number of readers that are available to read the message.

So now I'm back to trying to figure out why my subscriber does not see the same message from the OpenSplice application. Using 'netstat -tunlp | grep mySub' I can see that both my new cyclone app and my old OpenSplice app have UDP ports 7650 and 7651 open. I'm guessing from your explanation that these are the ports used for discovery? I can change the port number by changing the domainId so I pretty sure I have the correct domainId.

That leave the QoS which is supposed to be the default in both cases but clearly something id mis-aligned.

eboasson commented 2 years ago

Thanks for the explanation. I have it working now.

Happy to hear that!

I think the main problem was that I was not understanding what writer.publication_matched_status().current_count() does. I thought it was the number of readers that had read the message whereas it seems that it is the number of readers that are available to read the message.

That would explain it, yes; and your new interpretation is correct. (With the caveat that it is from the perspective of the writer, which matters if those readers have a reliability QoS setting of "best-effort".)

Using 'netstat -tunlp | grep mySub' I can see that both my new cyclone app and my old OpenSplice app have UDP ports 7650 and 7651 open. I'm guessing from your explanation that these are the ports used for discovery? I can change the port number by changing the domainId so I pretty sure I have the correct domainId.

Close, but not quite 🙂 7650 is indeed used for multicast discovery data, but 7651 is the used for multicast application data. Both will also have some other sockets in the "ephemeral" port number range, so ≥ 32768 that they use for sending and for receiving unicast traffic. The spec defines port numbers for unicast (7660 & 7661, 7662 & 7663, ...) but using those fixed port numbers are only useful if multicast discovery is not available or a firewall is needed. So rather than squat on a huge range of port numbers, Cyclone by default tries to be nice to the rest of the world.

That leave the QoS which is supposed to be the default in both cases but clearly something id mis-aligned.

Most likely. There are some things you can do, apart from studying the sources. None of them as pleasant as it should be because Cyclone's tooling needs some work:

Use the Python ddsls to dump the readers and writers along with their QoS settings, then look for the differences. (@yixianlv @thijsmie we didn't add a flag to only show readers and writers with QoS mismatches, did we?)
Configure Cyclone to write a trace with (at least) discovery info in it, and do the same. This doesn't get you anything that the ddsls doesn't do, except you don't need to install python. It is fairly easy to write the file (`CYCLONEDDS_URI="fine</></>" ...) will do, but it is not very readable. I wouldn't actually recommend it!
Read out the RequestedIncompatibleQosStatus or OfferedIncompatibleQosStatus status (which one depends on whether you have a reader or a writer) and look at last_policy_id (it has a name() method I think). It only gives you the most recent mismatching QoS setting detected, but in this case, that'd probably be enough.
Look at the sources 🙂 If you're using this it overrides the default QoS in number of ways, and the ones that are likely-to-certain to cause trouble are using a different partition and using a transient topic. (A transient writer with a volatile reader will work, the other way around won't.)

thijsmie commented 2 years ago

I'll respond here to the extra question, even though slightly unrelated. We don't have a QoS-mismatch-only mode in ddsls now but it should be relatively straightforward to implement. I have opened an issue for it.

zephod77 commented 2 years ago

Thanks for the info but like I said, this is an old system and I have python 2.7.5. ddsls says it needs at least 3.6.

I didn't understand your 2nd suggestion (Docs?) so I'll try the 3rd suggestion next.

eboasson commented 2 years ago

@zephod77 I didn't realize the python version would be a problem. Anyway, I forgot some other options you have, because OpenSplice itself also has tooling and there is also a little bit in the way of open-source tooling for it as well:

The Tuner allows you to inspect the DataWriter on the OpenSplice side, where you can look at the QoS mismatches (the docs should describe how to do this);
If you have the Tester tool, I believe it can show you the mismatches directly in the GUI, but I have used it even less than the Tuner;
There are also some open-source tools for OpenSplice that I wrote once upon a time. lsbuiltin is a bit like Cyclone's ddsls, but this is C code that should be straightforward to build given an OpenSplice installation.

I've done a first bit of documentation for my 2nd option some time ago, but I have to extend it a bit further and proof-read it a bit before doing a PR. I use that route often, but I know exactly what to look for and have developed my bag of tricks to quickly extract the useful data, and that's a skill most people fortunately have had no need to develop.

zephod77 commented 2 years ago

Unfortunately, I don't have any tools such as lsbuiltin or tuner or tester in this installation. The target machine is headless and doesn't have a GUI and it runs OpenSplice on a VLAN. I added:

status1 = reader.requested_incompatible_qos_status();
status2 = reader.subscription_matched_count();

and then print out the total_count() whenever it changes and whenever it reads a sample. (last_policy_id is a dds::core::policy::QosPolicyId which is typedef's to a uint32_t in types.hpp, ie no name() method)

With my OpenSplice apps running, I start my subscriber app and nothing prints. If I start my cyclone publisher app, then matched total_count goes to 1 but the incompatible total_count stays at 0. It seems that the cyclone app is not "seeing" anything from the OpenSplice app. With my cyclone sunscriber app running,

# netstat -unlp | grep cyc
udp        0      0 192.168.9.1:51684       0.0.0.0:*                           24973/./cyclonedds_
udp        0      0 0.0.0.0:52074           0.0.0.0:*                           24973/./cyclonedds_
udp        0      0 0.0.0.0:7650            0.0.0.0:*                           24973/./cyclonedds_
udp        0      0 0.0.0.0:7651            0.0.0.0:*                           24973/./cyclonedds_

any doing the same for my OpenSplice app: (I have NetworkInterfaceAddress set to 192.168.9.1 in both configs)

# netstat -unlp | grep mtu
udp        0      0 172.29.55.11:43000      0.0.0.0:*                           1770/mtu
udp        0      0 172.29.55.11:43001      0.0.0.0:*                           1770/mtu
udp        0      0 0.0.0.0:7650            0.0.0.0:*                           1770/mtu
udp        0      0 0.0.0.0:7651            0.0.0.0:*                           1770/mtu
udp        0      0 0.0.0.0:7662            0.0.0.0:*                           1770/mtu
udp        0      0 0.0.0.0:7663            0.0.0.0:*                           1770/mtu

Any idea what the 7662 and 7663 ports are?

eboasson commented 2 years ago

(last_policy_id is a dds::core::policy::QosPolicyId which is typedef's to a uint32_t in types.hpp, ie no name() method)

Must've been cross-eyed ... it doesn't make sense to me that there'd be something like QosPolicyCount and then map it to a plain int32_t. There is a project planned to go through the C++ API with a fine comb to deal with all such bits of nastiness, remove no longer relevant things and add whatever is missing. This clearly is something to look into.

Fortunately:

With my OpenSplice apps running, I start my subscriber app and nothing prints. If I start my cyclone publisher app, then matched total_count goes to 1 but the incompatible total_count stays at 0. It seems that the cyclone app is not "seeing" anything from the OpenSplice app.

No incompatible QoS means it doesn't matter much that it is a but an integer.

The only ways in which there can be no communication in the absence of security:

they don't discover each other at all
topic name or type name mismatch
no overlapping in published & subscribed partitions
RxO QoS mismatch
random bugs (and no, I am not suspecting that)

4 gets you an incompatible QoS notification (which we know not to be the case). Cases 1, 2 and 3 are silent (as is 5, but let's not go there for something as simple as this). If you don't have access to any tooling, then really the best bet is to write a trace file. Both Cyclone and OpenSplice can write these. For Cyclone:

<Tracing>
  <Verbosity>finest</Verbosity>
</Tracing>

in the configuration file pointed to by CYCLONEDDS_URI and for OpenSplice, it is the same inside the DDSI2Service element but in the file pointed to by OSPL_URI. I'll look at them if you post attach them to the issue.

Any idea what the 7662 and 7663 ports are?

Those are the spec'd discovery resp. application data ports for unicast traffic. If multicast discovery works, then there's no real need for any of those and that's why Cyclone doesn't create them by default. Typically multicast will work on Ethernet-like networks.

zephod77 commented 2 years ago

logs.zip The topic I'm trying to read is called TMMAcData. It gets sent at line 5678 of the ddsi2.log file. Thanks for looking.

eboasson commented 2 years ago

Hi @zephod77: the magic incantation rg '(SEDP|WRITER|READER).*TMMAcData' ~/Downloads/logs/cyclonedds.log gives:

/Users/erik/Downloads/logs/cyclonedds.log
225:1639671102.987229 [1] cyclonedds: READER 1106468:e5addb7b:da569369:104 QOS={user_data=0<>,topic_name="TMMAcData_Msg",type_name="MtuDataTypes_v1_0::TMMAcData",topic_data=0<>,group_data=0<>,durability=0,durability_service=0:0:1:-1:-1:-1,deadline=9223372036854775807,latency_budget=0,liveliness=0:9223372036854775807,reliability=0:100000000,lifespan=9223372036854775807,destination_order=0,history=0:1,resource_limits=-1:-1:-1,ownership=0,presentation=0:0:0,partition={""},time_based_filter=0,transport_priority=0,type_consistency_enforcement=1:00000,data_representation=1(0),adlink_entity_factory=1,adlink_reader_lifespan=0:9223372036854775807,adlink_reader_data_lifecycle=9223372036854775807:9223372036854775807,adlink_subscription_keys=0:{}}
1805:1639671117.063575 [1] dq.builtin: SEDP ST0 5e19bbbe:1c5:1:7703 reliable volatile writer: (default).TMMAcData/MtuDataTypes_v1_0::TMMAcData NEW (as udp/239.255.0.1:7651@40 udp/192.168.9.1:7661@40 ssm=0) QOS={user_data=0<>,topic_name="TMMAcData",type_name="MtuDataTypes_v1_0::TMMAcData",topic_data=0<>,group_data=0<>,durability=0,durability_service=0:0:1:-1:-1:-1,deadline=9223372036854775807,latency_budget=0,liveliness=0:9223372036854775807,reliability=1:100000000,lifespan=9223372036854775807,destination_order=0,history=0:1,resource_limits=-1:-1:-1,ownership=0,ownership_strength=0,presentation=0:0:0,partition={},transport_priority=0,data_representation=1(0),adlink_writer_data_lifecycle=1}

/Users/erik/Downloads/logs/ddsi2.log
485:1639671115.775732/dq.builtin: SEDP ST0 1106468:e5addb7b:da569369:104 best-effort volatile reader: (default).TMMAcData_Msg/MtuDataTypes_v1_0::TMMAcData NEW (as 239.255.0.1:7651 192.168.9.1:58053) QOS={topic=TMMAcData_Msg,type=MtuDataTypes_v1_0::TMMAcData,presentation=0:0:0,partition={},durability=0,durability_service=0.000000000:{0:1}:{-1:-1:-1},deadline=2147483647.999999999,latency_budget=0.000000000,liveliness=0:2147483647.999999999,reliability=0:0.100000000,destination_order=0,history=0:1,resource_limits=-1:-1:-1,transport_priority=0,lifespan=2147483647.999999999,ownership=0,time_based_filter=0.000000000,reader_data_lifecycle=2147483647.999999999:2147483647.999999999:0:1:1,relaxed_qos_matching=0,reader_lifespan={0,2147483647.999999999},subscription_keys={0,{}},share={0,},synchronous_endpoint=0}
5899:1639671117.063025/      main: WRITER 5e19bbbe:1c5:1:7703 QOS={topic=TMMAcData,type=MtuDataTypes_v1_0::TMMAcData,presentation=0:0:0,partition={},durability=0,durability_service=0.000000000:{0:1}:{-1:-1:-1},deadline=2147483647.999999999,latency_budget=0.000000000,liveliness=0:2147483647.999999999,reliability=1:0.100000000,destination_order=0,history=0:1,resource_limits=-1:-1:-1,transport_priority=0,lifespan=2147483647.999999999,ownership=0,ownership_strength=0,writer_data_lifecycle={1,2147483647.999999999,2147483647.999999999},relaxed_qos_matching=0,synchronous_endpoint=0}

Fortunately there is a clear difference that doesn't have to do with the QoS settings (because those are not in a very friendly representation, though I can assure you that eventually get to know all those magic constants 😖) and besides we already know it is a not an RxO QoS mismatch issue.

In fact, it is the topic name: in Cyclone you're using "TMMAcData_Msg", and in OpenSplice "TMMAcData"! If you fix that, I expect it will start working.

zephod77 commented 2 years ago

Well, that is strange. I used the same IDL file for both; the ddsi2 app used the C compiler to convert the IDL to code whereas with the Cyclone app I used the C++ compiler. I'm not logged into that machine at the moment but I will check it out this afternoon.

eboasson commented 2 years ago

The IDL file only gives the type names, and those are the same. The topic name is determined by what you pass into dds_create_topic/DDS_DomainParticipant_create_topic/the topic constructor in the C++11 API.

zephod77 commented 2 years ago

Ah!

zephod77 commented 2 years ago

Well, the behavior is now different, but not necessarily better. My cyclone app now gets a seg fault, probably when it receives some data. I've attached a zip file with the tracefile and a backtrace from gdb logs2.zip .

eboasson commented 2 years ago

Ouch. I would say its worse, because it is now doing something that it should absolutely never do: crash. And as you guessed, indeed the very first time (based on the log file) it tries to execute the generated C++ deserialiser for a MtuDataTypes_v1_0::TMMAcData sample (from the stack trace).

#0  0x00007ffff6c09cd3 in __memcpy_sse2 () from /lib64/libc.so.6
#1  0x00007ffff75123fb in std::string::_M_replace_safe(unsigned long, unsigned long, char const*, unsigned long) () from /lib64/libstdc++.so.6
#2  0x000000000043af71 in void org::eclipse::cyclonedds::core::cdr::read_string<std::string>(org::eclipse::cyclonedds::core::cdr::basic_cdr_stream&, std::string&, unsigned long)
    ()
#3  0x0000000000439a51 in void org::eclipse::cyclonedds::core::cdr::read<org::eclipse::cyclonedds::core::cdr::basic_cdr_stream>(org::eclipse::cyclonedds::core::cdr::basic_cdr_stream&, MtuDataTypes_v1_0::TMMAcData&) ()
#4  0x0000000000437606 in bool deserialize_sample_from_buffer<MtuDataTypes_v1_0::TMMAcData>(unsigned char*, MtuDataTypes_v1_0::TMMAcData&, ddsi_serdata_kind) ()
#5  0x00000000004397ae in ddscxx_serdata<MtuDataTypes_v1_0::TMMAcData>::deserialize_and_update_sample(unsigned char*, MtuDataTypes_v1_0::TMMAcData*&) ()
#6  0x0000000000436e14 in ddscxx_serdata<MtuDataTypes_v1_0::TMMAcData>::getT() ()
#7  0x0000000000433a0f in ddsi_serdata* serdata_from_ser<MtuDataTypes_v1_0::TMMAcData>(ddsi_sertype const*, ddsi_serdata_kind, nn_rdata const*, unsigned long) ()
#8  0x00007ffff7b6ab41 in ddsi_serdata_from_ser (size=<optimized out>, fragchain=<optimized out>, kind=SDK_DATA, type=0x660ac0)
    at /home/developer/develop/acs_src/KACotsCode/cyclonedds/src/core/ddsi/include/dds/ddsi/ddsi_serdata.h:258
#9  get_serdata (tstamp=..., statusinfo=0, justkey=0, sz=<optimized out>, fragchain=<optimized out>, type=0x660ac0)
    at /home/developer/develop/acs_src/KACotsCode/cyclonedds/src/core/ddsi/src/q_receive.c:1895
#10 remote_make_sample (tk=0x7ffff184d558, gv=0x650c98, type=0x660ac0, vsourceinfo=<optimized out>)
    at /home/developer/develop/acs_src/KACotsCode/cyclonedds/src/core/ddsi/src/q_receive.c:1944

Unfortunately it doesn't say anything about the addresses involved, that would perhaps have allowed an educated guess at the reason for the crash.

In the ddsi2.log of the first set of logs I found a hint to the type, and it seems to be really simple, just a bunch of strings followed by an enum. That's so simple that no edge cases are involved, that the IDL compiler should not have any difficulty dealing with it, and that OpenSplice can reasonably be expected to serialise it correctly. What is a bit weird is that it looks like garbage in that ddsi2.log:

1639671120.879192/ xmit.user: write_sample 5e19bbbe:1c5:1:7703 #1: ST0 TMMAcData/MtuDataTypes_v1_0::TMMAcData:{"\x80]\xaa","\x80]\xaa","\x80]\xaa","\x80]\xaa","\x80]\xaa","\x80]\xaa","\x80]\xaa",4026556673}

Still there is no real restriction on the contents of a string, so why it would crash on that is a mystery.

Would you be able to provide the IDL for just this type and the captured packet that makes it crash? If you have Wireshark, that's the easiest (you probably need to capture the loopback interface). The packet should be easily identifiable with a recent version of Wireshark because it tries to print topic names for DATA submessages nowadays. All packets of a short capture are ok, too, if it is the same to you.

Alternatively, you can ask OpenSplice to write a packet capture file in Wireshark format:

<DDSI2Service>
  <Tracing>
    <PacketCaptureFile>crash.pcap</PacketCaptureFile>
  </Tracing>
</DDSI2Service>

I hope that still works ... Cyclone can do it, too, configured in the same manner, but I'm a little worried that a crash immediately after receiving the packet may mean the packet is still cached somewhere in the C stdio buffers and not actually written to disk yet. But you can try.

Having the packet and the type should allow us to reproduce the crash.

zephod77 commented 2 years ago

Well, I found the problem and it was of my own doing. I had added an unsigned long to the TMMAcData topic to use as a sequence number so I could see if we were losing packets, I rebuilt my app and reloaded it but I forgot to update the topic library that it uses. Once I did that, it started working. (The clue was you had mentioned seeing just stings and an enum, ie no unsigned long) Thank you for your help and I apologize for the noise. It has been very interesting learning about some of these debugging techniques and config file options I did not know about. What are the other tracing options besides finest? Now I need to do some more testing with multi-threaded apps and learn more about qos.