OpenDDS / OpenDDS

OpenDDS is an open source C++ implementation of the Object Management Group (OMG) Data Distribution Service (DDS). OpenDDS also supports Java bindings through JNI.
http://www.opendds.org
Other
1.29k stars 465 forks source link

ERROR: Spdp::SpdpTransport::join_multicast_group() - failed to join multicast group 239.255.0.1:8400 ACE_SOCK_Dgram_Mcast::join: Unknown error -5 #1713

Closed jmccabe closed 1 month ago

jmccabe commented 4 years ago

(Note: I first posted this to https://sourceforge.net/p/opendds/mailman/opendds-main/?viewmonth=202006&style=flat, please delete/close if this is not the right place)

On OpenDDS 3.14, on an embedded target, I'm seeing "ERROR: Spdp::SpdpTransport::join_multicast_group() - failed to join multicast group 239.255.0.1:8400 ACE_SOCK_Dgram_Mcast::join: Unknown error -5" on an application that (other than a few updates to handle the C++11 style opendds_idl code generation) is the same as one that works with OpenDDS 3.13.3.

The target is an ARM-based embedded linux system with Avahi installed to acquire an mDNS IP address. It uses rtps with an ini file containing:

[common] DCPSGlobalTransportConfig=$file DCPSDefaultDiscovery=TheRTPSConfig [rtps_discovery/TheRTPSConfig] SedpLocalAddress=169.254.5.198 [transport/the_rtps_transport] transport_type=rtps_udp

Using the OpenDDS 3.13.3 version of the code, with DCPSDebugLevel 10, I see:

<.. some application logging ..> (1978|1978) NOTICE: using DCPSDebugLevel value from command option (overrides value if it's in config file) (1978|1978) NOTICE: using DCPSDefaultAddress value from command option (overrides value if it's in config file) (1978|1978) NOTICE: Service_Participant::load_domain_configuration failed to open [domain] section - using code default. (1978|1978) NOTICE: StaticDiscovery::parse_topics no [topic] sections. (1978|1978) NOTICE: StaticDiscovery::parse_datawriterqos no [datawriterqos] sections. (1978|1978) NOTICE: StaticDiscovery::parse_datareaderqos no [datareaderqos] sections. (1978|1978) NOTICE: StaticDiscovery::parse_publisherqos no [publisherqos] sections. (1978|1978) NOTICE: StaticDiscovery::parse_subscriberqos no [subscriberqos] sections. (1978|1978) NOTICE: StaticDiscovery::parse_endpoints no [endpoint] sections. (1978|1978) NOTICE: Service_Participant::intializeScheduling() - no scheduling policy specified, not setting policy. (1978|1978) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig. (1978|1978) Spdp::SpdpTransport::open_unicast_socket() - opened unicast socket on port 39490 (1978|1978) Spdp::SpdpTransport::SpdpTransport joining group 239.255.0.1:8400 (1978|1978) DirectPriorityMapper:thread_priority() - mapped TRANSPORT_PRIORITY value 0 to thread priority 0. (1978|1978) DomainParticipantImpl::enable: enabled participant 0103801f.12691597.07babcb4.000001c1(1494357f) in domain 4 (1978|1978) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig. (1978|1978) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig. (1978|1978) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig. (1978|1978) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig. (1978|1978) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig. (1978|1978) DDS::ParticipantBuiltinTopicDataDataReaderImpl::enable_specific-data Cached_Allocator_With_Overflow 13b6d0 with 20 chunks (1978|1978) DataReaderImpl::enable Cached_Allocator_With_Overflow 13bab8 with 20 chunks (1978|1978) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig. (1978|1978) SubscriberImpl::reader_enabled, datareader(topic_name=DCPSParticipant) enabled

With the 3.14 version, I see:

<.. some application logging ..> (1949|1949) NOTICE: using DCPSDebugLevel value from command option (overrides value if it's in config file) (1949|1949) NOTICE: using DCPSDefaultAddress value from command option (overrides value if it's in config file) (1949|1949) NOTICE: Service_Participant::load_domain_configuration failed to open [domain] section - using code default. (1949|1949) NOTICE: StaticDiscovery::parse_topics no [topic] sections. (1949|1949) NOTICE: StaticDiscovery::parse_datawriterqos no [datawriterqos] sections. (1949|1949) NOTICE: StaticDiscovery::parse_datareaderqos no [datareaderqos] sections. (1949|1949) NOTICE: StaticDiscovery::parse_publisherqos no [publisherqos] sections. (1949|1949) NOTICE: StaticDiscovery::parse_subscriberqos no [subscriberqos] sections. (1949|1949) NOTICE: StaticDiscovery::parse_endpoints no [endpoint] sections. (1949|1949) NOTICE: Service_Participant::intializeScheduling() - no scheduling policy specified, not setting policy. (1949|1949) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig. (1949|1949) Spdp::SpdpTransport::open_unicast_socket() - opened unicast socket on port 8410 18:03:38.541663 (1949|1949) Service_Participant::network_config_monitor(). Creating LinuxNetworkConfigMonitor (1949|1949) Spdp::SpdpTransport::join_multicast_group joining group eth0 239.255.0.1:8400 (1949|1949) ERROR: Spdp::SpdpTransport::join_multicast_group() - failed to join multicast group 239.255.0.1:8400 ACE_SOCK_Dgram_Mcast::join: Unknown error -5 (1949|1949) DirectPriorityMapper:thread_priority() - mapped TRANSPORT_PRIORITY value 0 to thread priority 0. (1949|1949) RtpsUdpDataLink::join_multicast_group joining group eth0 239.255.0.2:7401:8402 (1949|1949) ERROR: RtpsUdpDataLink::join_multicast_group(): ACE_SOCK_Dgram_Mcast::join failed: Unknown error -5 (1949|1949) DomainParticipantImpl::enable: enabled participant 0103801f.12691597.079dabab.000001c1(e2ed001d) in domain 4 (1949|1949) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig. (1949|1949) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig. (1949|1949) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.

ifconfig on the embedded device shows:

eth0 Link encap:Ethernet HWaddr 80:1F:12:69:15:97 inet6 addr: fe80::821f:12ff:fe69:1597%lo/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:2559 errors:0 dropped:0 overruns:0 frame:0 TX packets:742 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:171002 (166.9 KiB) TX bytes:217192 (212.1 KiB) Interrupt:30 Base address:0xb000

eth0:avahi Link encap:Ethernet HWaddr 80:1F:12:69:15:97 inet addr:169.254.5.198 Bcast:169.254.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interrupt:30 Base address:0xb000

lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1%1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:20 errors:0 dropped:0 overruns:0 frame:0 TX packets:20 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1280 (1.2 KiB) TX bytes:1280 (1.2 KiB)

As the 3.14 version is explicitly showing that it's trying to join a group using eth0 (not eth0:avahi), I've tried modifying the embedded linux device's network configuration to apply the 169.254.5.198 address as a static address on eth0 and restarting the eth0 device.

Now, when I run my 3.14 version, it works as it should do, however I have a need to use mDNS on the system I'm working on so this may not be a long term solution.

Is something known to have changed between 3.13.3 and 3.14 to have this effect and, if so, is there a configuration option (e.g. command line or rtps.ini file) I can use to overcome it?

I have tried explicitly specifying the SpdpLocalAddress in my rtps.ini, with no effect. I have also tried configuring the MulticastInterface setting in my rtps.ini to eth0:avahi, but the application just segfaults like that.

jmccabe commented 4 years ago

I've continued to look at this, and have found some interesting stuff using my debugger, well...

I've had a suggestion on the mailing list about adding a MulticastInterface=eht0:avahi entry into the discovery section of my rtps.ini, and also a multicast_interface=eth0:avahi into the transport section.

The latter makes no difference, and the former results in a segmentation fault. In trying to debug the segmentation fault, I think it's a side-effect of the multicast join not working, but I've been able to follow through to find that Spdp::SpdpTransport::join_multicast_group() is called 3 times. Debugging this bit of code:

void
Spdp::SpdpTransport::join_multicast_group(const DCPS::NetworkInterface& nic,
                                          bool all_interfaces)
{
  if (joined_interfaces_.count(nic.name()) != 0 || nic.addresses.empty() || !nic.can_multicast()) {
    return;    <<<< line 2418
  }

  if (!multicast_interface_.empty() && nic.name() != multicast_interface_) {
    return;    <<<<< line 2422
  }

I follow through with the following values in nic.name():

1) lo - which returns from line 2418 2) eth0 - which returns from line 2422 3) sit0 - which returns from line 2418

ifconfig doesn't show sit0 at all (see previous comment), and join_multicast_group is never passed eth0::avahi!

This bit of code is being executed in Sdpd.cpp, in the Spdp::SpdpTransport constructor:

DCPS::NetworkConfigMonitor_rch ncm = TheServiceParticipant->network_config_monitor();
if (ncm) {
  const DCPS::NetworkInterfaces nics = ncm->add_listener(*this);

  for (DCPS::NetworkInterfaces::const_iterator pos = nics.begin(), limit = nics.end(); pos != limit; ++pos) {
    join_multicast_group(*pos);
  }
} else {

which is where join_multicast_group() is being called from using the nics collection retrieved from the network_config_monitor(), so the network_config_monitor seems to think there are 3 nics, with names lo, eth0 and sit0.

Any ideas where sit0 is coming from, and why eth0:avahi isn't in there?

John

jmccabe commented 4 years ago

Further comments from the mailing list:

sit0 does get shown by ifconfig -a; it's a tunneling device for using IPV6 over an IPV4 connection.

I was asked to try the latest master branch, however that didn't work; changes made to use ACE_INET_Addr objects for holding IP addresses, when passed in a "xxx.yyy.zzz.aaa" don't work as they use the ACE_INET_Addr::set() function which, when the string has no ":" in it, treats the whole thing as a port number! See #1717.

mitza-oci commented 4 years ago

I was asked to try the latest master branch, however that didn't work; changes made to use ACE_INET_Addr objects for holding IP addresses, when passed in a "xxx.yyy.zzz.aaa" don't work as they use the ACE_INET_Addr::set() function which, when the string has no ":" in it, treats the whole thing as a port number

What happens when you just append a : or patch the code to work around it?

jmccabe commented 4 years ago

Leaving rtps.ini as is and changing the command line to use -DCPSDefaultAddress 169.254.5.198: gives:

(4051|4051) ERROR: RtpsDiscovery::Config::discovery_config(): failed to parse SedpLocalAddress 169.254.5.198
(4051|4051) ERROR: Service_Participant::load_configuration load_discovery_configuration() returned -1
(4051|4051) ERROR: Service_Participant::get_domain_participant_factory: load_configuration() failed.
Segmentation fault

(As before, I think the segfault results from trying to continue after the DDS configuration hasn't worked properly, rather than directly from the DDS stuff).

This corresponds to the issues mentioned in #1717; there are a number of IP address parameter fields that use the same faulty parsing mechanism.

If I change the SedpLocalAddress in the rtps.ini to have a : after it (no MulticastInterface in discovery section, or multicast_interface in transport section):

(4155|4155) Spdp::SpdpTransport::open_unicast_socket() - opened unicast socket on port 8410
(4155|4155) DirectPriorityMapper:thread_priority() - mapped TRANSPORT_PRIORITY value 0 to thread priority 0.
(4155|4155) TransportReceiveStrategy-mb Cached_Allocator_With_Overflow 1e47f0 with 1000 chunks
(4155|4155) TransportReceiveStrategy-db Cached_Allocator_With_Overflow 1e48c4 with 100 chunks
(4155|4155) TransportReceiveStrategy-data Cached_Allocator_With_Overflow 1e4998 with 100 chunks
 21:44:05.198219 (4155|4155) Service_Participant::network_config_monitor(). Creating LinuxNetworkConfigMonitor
(4155|4158) RtpsUdpDataLink::join_multicast_group joining group 239.255.0.1:8402 on eth0
(4155|4155) TransportImpl::open()
   transport_type:               rtps_udp
   name:                         _OPENDDS__SEDPTransportInst_0103801f12691597103b32624
   queue_messages_per_pool:      10
   queue_initial_pools:          5
   max_packet_size:              2147481599
   max_samples_per_packet:       10
   optimum_packet_size:          4096
   thread_per_connection:        false
   datalink_release_delay:       10000
   datalink_control_chunks:      32
   local_address:                169.254.5.198:43969
   use_multicast:                true
   multicast_group_address:      239.255.0.1:8402
   multicast_interface:
   nak_depth:                    0
   max_bundle_size:              65446
   nak_response_delay:           200
   heartbeat_period:             1000
   heartbeat_response_delay:     500
   handshake_timeout:            150000
(4155|4155) DomainParticipantImpl::enable: enabled participant 0103801f.12691597.103b3262.000001c1(6e84135a) in domain 4
(4155|4155) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4155|4155) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4155|4155) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4155|4155) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4155|4155) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4155|4155) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4155|4155) DDS::ParticipantBuiltinTopicDataDataReaderImpl::enable_specific-data Cached_Allocator_With_Overflow 1f6758 with 20 chunks
(4155|4155) DataReaderImpl::enable Cached_Allocator_With_Overflow 1f6b40 with 20 chunks
(4155|4155) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4155|4155) SubscriberImpl::reader_enabled, datareader(topic_name=DCPSParticipant) enabled
(4155|4158) ERROR: RtpsUdpDataLink::join_multicast_group(): ACE_SOCK_Dgram_Mcast::join failed: Unknown error -5
(4155|4158) RtpsUdpDataLink::join_multicast_group joining group 239.255.0.1:8402 on eth0
(4155|4155) DDS::TopicBuiltinTopicDataDataReaderImpl::enable_specific-data Cached_Allocator_With_Overflow 1f9218 with 20 chunks
(4155|4155) DataReaderImpl::enable Cached_Allocator_With_Overflow 1f9f60 with 20 chunks
(4155|4155) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4155|4155) SubscriberImpl::reader_enabled, datareader(topic_name=DCPSTopic) enabled
(4155|4155) DDS::PublicationBuiltinTopicDataDataReaderImpl::enable_specific-data Cached_Allocator_With_Overflow 1fc648 with 20 chunks
(4155|4155) DataReaderImpl::enable Cached_Allocator_With_Overflow 1fd930 with 20 chunks
(4155|4155) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4155|4155) SubscriberImpl::reader_enabled, datareader(topic_name=DCPSPublication) enabled
(4155|4158) ERROR: RtpsUdpDataLink::join_multicast_group(): ACE_SOCK_Dgram_Mcast::join failed: Unknown error -5
(4155|4155) DDS::SubscriptionBuiltinTopicDataDataReaderImpl::enable_specific-data Cached_Allocator_With_Overflow 200000 with 20 chunks
(4155|4155) DataReaderImpl::enable Cached_Allocator_With_Overflow 201068 with 20 chunks
(4155|4155) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4155|4155) SubscriberImpl::reader_enabled, datareader(topic_name=DCPSSubscription) enabled
(4155|4155) OpenDDS::DCPS::ParticipantLocationBuiltinTopicDataDataReaderImpl::enable_specific-data Cached_Allocator_With_Overflow 203770 with 20 chunks
(4155|4155) DataReaderImpl::enable Cached_Allocator_With_Overflow 204058 with 20 chunks
(4155|4155) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4155|4155) SubscriberImpl::reader_enabled, datareader(topic_name=OpenDDSParticipantLocation) enabled
(4155|4160) Spdp::SpdpTransport::join_multicast_group joining group 239.255.0.1:8400 on eth0
(4155|4160) ERROR: Spdp::SpdpTransport::join_multicast_group() - failed to join multicast group 239.255.0.1:8400 on eth0: ACE_SOCK_Dgram_Mcast::join: Unknown error -5
(4155|4155) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4155|4160) Spdp::SpdpTransport::join_multicast_group joining group 239.255.0.1:8400 on eth0
(4155|4160) ERROR: Spdp::SpdpTransport::join_multicast_group() - failed to join multicast group 239.255.0.1:8400 on eth0: ACE_SOCK_Dgram_Mcast::join: Unknown error -5

(Same as before)

Adding MulticastInterface=eth0:avahi into the discovery section...

(4240|4240) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4240|4240) Spdp::SpdpTransport::open_unicast_socket() - opened unicast socket on port 8410
(4240|4240) DirectPriorityMapper:thread_priority() - mapped TRANSPORT_PRIORITY value 0 to thread priority 0.
(4240|4240) TransportReceiveStrategy-mb Cached_Allocator_With_Overflow 1e4838 with 1000 chunks
(4240|4240) TransportReceiveStrategy-db Cached_Allocator_With_Overflow 1e490c with 100 chunks
(4240|4240) TransportReceiveStrategy-data Cached_Allocator_With_Overflow 1e49e0 with 100 chunks
 21:50:21.712336 (4240|4240) Service_Participant::network_config_monitor(). Creating LinuxNetworkConfigMonitor
(4240|4240) TransportImpl::open()
   transport_type:               rtps_udp
   name:                         _OPENDDS__SEDPTransportInst_0103801f1269159710909c884
   queue_messages_per_pool:      10
   queue_initial_pools:          5
   max_packet_size:              2147481599
   max_samples_per_packet:       10
   optimum_packet_size:          4096
   thread_per_connection:        false
   datalink_release_delay:       10000
   datalink_control_chunks:      32
   local_address:                169.254.5.198:51714
   use_multicast:                true
   multicast_group_address:      239.255.0.1:8402
   multicast_interface:          eth0:avahi
   nak_depth:                    0
   max_bundle_size:              65446
   nak_response_delay:           200
   heartbeat_period:             1000
   heartbeat_response_delay:     500
   handshake_timeout:            150000
(4240|4240) DomainParticipantImpl::enable: enabled participant 0103801f.12691597.10909c88.000001c1(6a5a3b2e) in domain 4
(4240|4240) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4240|4240) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4240|4240) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4240|4240) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4240|4240) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4240|4240) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4240|4240) DDS::ParticipantBuiltinTopicDataDataReaderImpl::enable_specific-data Cached_Allocator_With_Overflow 1f6768 with 20 chunks
(4240|4240) DataReaderImpl::enable Cached_Allocator_With_Overflow 1f6b50 with 20 chunks
(4240|4240) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4240|4240) SubscriberImpl::reader_enabled, datareader(topic_name=DCPSParticipant) enabled
(4240|4240) DDS::TopicBuiltinTopicDataDataReaderImpl::enable_specific-data Cached_Allocator_With_Overflow 1f9228 with 20 chunks
(4240|4240) DataReaderImpl::enable Cached_Allocator_With_Overflow 1f9f70 with 20 chunks
(4240|4240) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4240|4240) SubscriberImpl::reader_enabled, datareader(topic_name=DCPSTopic) enabled
(4240|4240) DDS::PublicationBuiltinTopicDataDataReaderImpl::enable_specific-data Cached_Allocator_With_Overflow 1fc658 with 20 chunks
(4240|4240) DataReaderImpl::enable Cached_Allocator_With_Overflow 1fd940 with 20 chunks
(4240|4240) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4240|4240) SubscriberImpl::reader_enabled, datareader(topic_name=DCPSPublication) enabled
(4240|4240) DDS::SubscriptionBuiltinTopicDataDataReaderImpl::enable_specific-data Cached_Allocator_With_Overflow 200010 with 20 chunks
(4240|4240) DataReaderImpl::enable Cached_Allocator_With_Overflow 201078 with 20 chunks
(4240|4240) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4240|4240) SubscriberImpl::reader_enabled, datareader(topic_name=DCPSSubscription) enabled
(4240|4240) OpenDDS::DCPS::ParticipantLocationBuiltinTopicDataDataReaderImpl::enable_specific-data Cached_Allocator_With_Overflow 203780 with 20 chunks
(4240|4240) DataReaderImpl::enable Cached_Allocator_With_Overflow 204068 with 20 chunks
(4240|4240) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4240|4240) SubscriberImpl::reader_enabled, datareader(topic_name=OpenDDSParticipantLocation) enabled
(4240|4240) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4240|4240) kcema::csmidl::StateEventDataWriterImpl::enable_specific is unbounded data - allocate from heap
(4240|4240) kcema::csmidl::StateEventDataWriterImpl::enable_specific-mb Cached_Allocator_With_Overflow 20d3b0 with 200 chunks
(4240|4240) kcema::csmidl::StateEventDataWriterImpl::enable_specific-db Cached_Allocator_With_Overflow 20f3d0 with 20 chunks
(4240|4240) WriteDataContainer sample_list_element_allocator 20f99c with 20 chunks
(4240|4240) DataWriterImpl::enable-mb Cached_Allocator_With_Overflow 212518 with 20 chunks
(4240|4240) DataWriterImpl::enable-db Cached_Allocator_With_Overflow 214538 with 20 chunks
(4240|4240) DataWriterImpl::enable-header Cached_Allocator_With_Overflow 214960 with 20 chunks
(4240|4240) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
ACE_INET_Addr::ACE_INET_Addr: 169.254.5.198:: Unknown error -2
(4240|4240) TransportImpl::open()
   transport_type:               rtps_udp
   name:                         the_rtps_transport
   queue_messages_per_pool:      10
   queue_initial_pools:          5
   max_packet_size:              2147481599
   max_samples_per_packet:       10
   optimum_packet_size:          4096
   thread_per_connection:        false
   datalink_release_delay:       10000
   datalink_control_chunks:      32
   local_address:                0.0.0.0:51693
   use_multicast:                true
   multicast_group_address:      239.255.0.2:7401
   multicast_interface:
   nak_depth:                    0
   max_bundle_size:              65446
   nak_response_delay:           200
   heartbeat_period:             1000
   heartbeat_response_delay:     500
   handshake_timeout:            30000
(4240|4240) Sedp::write_publication_data - not currently associated, dropping msg.

This line's interesting "ACE_INET_Addr::ACE_INET_Addr: 169.254.5.198:: Unknown error -2"; 2 x ":"?

Adding multicast_interface=eth0:avahi into the tranport section instead of previous change:

Back to:

(4312|4315) RtpsUdpDataLink::join_multicast_group joining group 239.255.0.1:8402 on eth0
...
(4312|4312) DataWriterImpl::enable-header Cached_Allocator_With_Overflow 2149c0 with 20 chunks
(4312|4312) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(4312|4317) Spdp::SpdpTransport::join_multicast_group joining group 239.255.0.1:8400 on eth0
ACE_INET_Addr::ACE_INET_Addr: 169.254.5.198:: Unknown error -2
(4312|4317) ERROR: Spdp::SpdpTransport::join_multicast_group() - failed to join multicast group 239.255.0.1:8400 on eth0: ACE_SOCK_Dgram_Mcast::join: Unknown error -5

I thought about patching the code, but it's an issue in multiple places and I don't think I'm currently well-placed enough to understand all of the places where it may be broken.

mitza-oci commented 4 years ago

Now that #1727 is merged, is this at least easier to test? If it's down to just "error -2" from joining a multicast group, the next step may be to verify with the debugger that the system calls being made by ACE are correct for your configuration. Does the output of "ip a" show that the interface 169.254.4.198 has multicast?

jmccabe commented 4 years ago

Adam,

With #1717 solved, the -DCPSDefaultAddress 169.254.5.198 works but, as mentioned in #1738, I then got an issue with SedpLocalAddress in the rtps.ini file. You've closed that one as the Dev Guide does say it needs a trailing ':' (which is something I've never seen expected in any other software!), although #1738 also mentions that SpdpLocalAddress, which the dev guide specifically says "No Port" with, uses identical code so is going to fail if you don't change the dev guide for that too.

With the ':' in place, I'm back to:

(2486|2490) RtpsUdpDataLink::join_multicast_group joining group 239.255.0.1:8402 on eth0
(2486|2486) TransportImpl::open()
   transport_type:               rtps_udp
   name:                         _OPENDDS__SEDPTransportInst_0103801f1269159709b667d14
   queue_messages_per_pool:      10
   queue_initial_pools:          5
   max_packet_size:              2147481599
   max_samples_per_packet:       10
   optimum_packet_size:          4096
   thread_per_connection:        false
   datalink_release_delay:       10000
   datalink_control_chunks:      32
   local_address:                169.254.5.198:50590
   use_multicast:                true
   multicast_group_address:      239.255.0.1:8402
   multicast_interface:
   nak_depth:                    0
   max_bundle_size:              65446
   nak_response_delay:           200
   heartbeat_period:             1000
   heartbeat_response_delay:     500
   handshake_timeout:            150000
(2486|2486) DomainParticipantImpl::enable: enabled participant 0103801f.12691597.09b667d1.000001c1(ef9559e8) in domain 4
(2486|2486) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(2486|2486) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(2486|2486) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(2486|2486) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(2486|2486) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(2486|2486) Service_Participant::get_discovery: returning repository for domain 4, repo TheRTPSConfig.
(2486|2490) ERROR: RtpsUdpDataLink::join_multicast_group(): ACE_SOCK_Dgram_Mcast::join failed: Unknown error -5

Adding the eth0:avahi device as "MulticastInterface"/"multicast_interface", as before, stops any attempt at joining a multicast group.

Re your last question:

ip a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 80:1f:12:69:15:97 brd ff:ff:ff:ff:ff:ff inet 169.254.5.198/16 brd 169.254.255.255 scope link eth0:avahi valid_lft forever preferred_lft forever inet6 fe80::821f:12ff:fe69:1597/64 scope link valid_lft forever preferred_lft forever 3: sit0@NONE: mtu 1480 qdisc noop state DOWN group default qlen 1000 link/sit 0.0.0.0 brd 0.0.0.0

jmccabe commented 4 years ago

FWIW - I've been looking at something related to this in my own code recently, trying to get mac addresses and IP addresses using the ifaddrs stuff in Linux. The eth0, sit0, and lo devices show up when I look for AF_PACKET devices, but not the eth0:avahi device. For AF_INET devices, lo and eth0:avahi, but not eth0 or sti0, and the eth0:avahi device shows the 169.254.5.198 address.

lo       00:00:00:00:00:00
eth0     80:1f:12:69:15:97
sit0     00:00:00:00
Interface: lo   Address: 127.0.0.1
Interface: eth0:avahi   Address: 169.254.5.198 LINK LOCAL

From this code:

#include <arpa/inet.h>
#include <stdio.h>
#include <ifaddrs.h>
#include <netpacket/packet.h>

int getMacAddress()
{
    struct ifaddrs *ifaddr = NULL;
    struct ifaddrs *ifa = NULL;
    int i = 0;

    if (getifaddrs(&ifaddr) == -1)
    {
         perror("getifaddrs");
    }
    else
    {
        for (ifa = ifaddr; ifa != NULL; ifa = ifa->ifa_next)
        {
            if (ifa->ifa_addr)
            {
                if (ifa->ifa_addr->sa_family == AF_PACKET)
                {
                    struct sockaddr_ll *s = (struct sockaddr_ll*)ifa->ifa_addr;
                    printf("%-8s ", ifa->ifa_name);
                    for (i = 0; i < s->sll_halen; i++)
                    {
                        printf("%02x%c", (s->sll_addr[i]), (i + 1 != s->sll_halen) ? ':' : '\n');
                    }
                }
                if (ifa->ifa_addr->sa_family == AF_INET)
                {
                    struct sockaddr_in *s = (struct sockaddr_in *)ifa->ifa_addr;
                    char *addr = inet_ntoa(s->sin_addr);
                    printf("Interface: %s\tAddress: %s", ifa->ifa_name, addr);
                    if ((htonl(s->sin_addr.s_addr) & 0xa9fe0000) == 0xa9fe0000)
                    {
                        printf(" LINK LOCAL\n");
                    }
                    else
                    {
                        printf("\n");
                    }
                }
            }
        }
        freeifaddrs(ifaddr);
    }

    return 0;
}

int main()
{
    getMacAddress();

    return 0;
}
jwillemsen commented 4 years ago

See https://github.com/DOCGroup/ACE_TAO/blob/master/ACE/tests/Enum_Interfaces_Test.cpp for an ACE unit tests to list all ip interfaces, maybe compile the ACE/tests directory and run it?

jmccabe commented 4 years ago

I copied some of the stuff into my application, as I couldn't be arsed to work out how to compile it on its own. This is the result:

Machine: petalinux running on armv7l
Platform: Linux, 4.14.0-xilinx-v2018.3, #1 SMP PREEMPT Thu Nov 28 22:32:55 UTC 2019
 there are 2 interfaces
        127.0.0.1
        169.254.5.198
 there are 2 IPv4 interfaces, and 0 IPv6 interfaces

Those addresses represent the lo and the eth0:avahi devices, but it doesn't actually tell you the interface name. I'll see if I can find that stuff.

jmccabe commented 4 years ago

Nope - don't really know my way round ACE, but can't see any obvious way to get the name of an interface from the IP addr, other than going down to the ifaddrs stuff which is basically the code I showed earlier. If you can see how to, please let me know.

Also, I'm sure I read something elsewhere about the NetworkConfigMonitor being changed to be optional; is that a build time configuration option?

I can see (in LinuxNetworkConfigMonitor::process_message):

RTM_NEWLINK -> for lo, index 1 -> add_interface is called RTM_NEWLINK -> for eth0, index 2 -> add_interface is called RTM_NEWLINK -> for sit0, index 3 -> add_interface is called RTM_NEWADDR -> for 127.0.0.1, index 1 -> add_address called RTM_NEWADDR -> for 169.254.5.198, index 2 -> add_address called RTM_NEWADDR -> for 0.0.0.0, index 1 -> add_address called RTM_NEWADDR -> for 0.0.0.0, index 2 -> add_address called

It seems that the code can't tell the difference between eth0, which doesn't have an IPV4 address, and eth0:avahi, which does.

I'm a bit confused at this point; is there a way to make this behave more like 3.13.3?

mitza-oci commented 3 years ago

You can try disabling LinuxNetworkConfigMonitor

#if (defined(ACE_LINUX) || defined(ACE_ANDROID)) && !defined(OPENDDS_SAFETY_PROFILE)

just make it #if 0

It would be good to know what's actually going wrong there. Can you list steps required to get a similar network configuration starting with a "stock" Debian/Ubuntu install?

jmccabe commented 3 years ago

Sorry for the delay in replying; I've been on holiday.

I will try your suggestion once I get the chance.

As far "list the steps required", that's easier said than done :-) I'm no Linux expert and the way that alias popped up wasn't something I specifically chose :-)

I suspect the issue is related to something like I described in https://github.com/objectcomputing/OpenDDS/issues/1713#issuecomment-642652429; when you look for AF_PACKET devices you don't see the eth0:avahi alias but, when you look at AF_INET, you see the IP address associated with it but it seems to associate it with eth0 rather than eth0:avahi.

I will take a look round to see if I can work out a way to get something similar on Ubuntu.

jmccabe commented 3 years ago

Apologies for the further delay on this. Extremely tight timescales and issues with staff have left me no time to play around with these settings. It remains on my "to do" list.

jmccabe commented 3 years ago

Just a quick note: I've built 3.16 with the modification suggested in https://github.com/objectcomputing/OpenDDS/issues/1713#issuecomment-669203011 and it successfully runs the Messenger publisher with the subscriber running on a native Ubuntu build. While I wouldn't consider this enough to close this issue, it's a step in the right direction.

mitza-oci commented 1 year ago

Multicast support (with multiple interfaces) has been improved recently. Is this still a problem?

jmccabe commented 1 year ago

Interestingly, I was just talking to a colleague about this issue yesterday. Unfortunately we don't have the time or resources to be able to keep up-to-date with OpenDDS releases, so we're still running with 3.16. If we get the chance to try the latest version at some point, I will be sure to let you know if this is still, or is no longer a problem.

jrw972 commented 1 year ago

We believe this has been addressed in more recent versions of OpenDDS (3.23).

jmccabe commented 1 year ago

FWIW, we've recently started running 3.22 and, as I understand it (it was my colleague who tried it), we still have this issue in that version. Do you have any test results that prove this is fixed? If not, can I suggest you please reopen it; it's possible that I may be in a position within the next few days to check this.

jrw972 commented 1 year ago

We do not have a way to reproduce this error so we cannot definitively say that it is addressed. You will need to submit a PR with test that demonstrates the problem.

jmccabe commented 1 year ago

I can't really do that; the testing I've done has been manual, and you need a target system that's ARM based with an ethn:avahi network device to highlight the problem. I don't know if there's a way to simulate that! However, I've built 3.23 today for the ARM target and tried the Messenger application. With multicast_interface=eth0:avahi in the RTPS section of the ini file, and MulticastInterface=eth0:avahi in the discovery section, I was able to get through the publisher and subscriber startups without any error showing (without both of those there are errors; 2 x errors if neither is in, or 1 x error if one or the other is in). I've also seen RTPS packets received in another machine connected to the same network. That looks promising. What I haven't been able to do, though, is get the publisher and subscriber to communicate with each other yet, but that might be to do with things that are awkward today, as I'm working from home and the system I'm using is remote.

I will be back in the office tomorrow and will try again with 'better' equipment, then report back.

jmccabe commented 1 year ago

Further information. I've tried building my applications using OpenDDS 3.23 built using the normal LinuxNetworkMonitor included. As with the Messenger application, I was able to get the application to start, without showing the errors, by explicitly specifying the MulticastInterface and multicast_interface within the rtps.ini sections but, despite that, there was no end to end communication happening.

Commenting out the LinuxNetworkMonitor, as described earlier, fixed that issue, so nothing appears to have changed to improve things from my point of view.

jrw972 commented 1 year ago

Just to reset...

The LinuxNetworkConfigMonitor uses a NETLINK socket to receive changes in network interfaces and address from the Linux kernel. These come in distinct messages, i.e., changes in interfaces come in one set of messages and changes in address comes in a different set of messages. The process_message function does the heavy lifting. First, it maintains the set of interfaces in network_interface_map_ which records the name and whether or not that interface can multicast (RTM_NEWLINK and RTM_DELLINK). Second, it processes changes in addresses (RTM_NEWADDR and RTM_DELADDR) and combines this with the details of the network interface. These changes are passed to NetworkConfigMonitor via set, clear, remove_interface, and remove-address.

NetworkConfigMonitor publishes these on an Internal DDS topic (network_interface_address_topic_ in the Service Participant). The sample is a NetworkInterfaceAddress which contains the name of the interface, a flag indicating multicast, and the address. The Internal DDS topic is read by Spdp and the RtpsUdpTransport (for both Sedp and data). Upon reading a sample it dispatches into MulticastManager::process. MulticastManager::process does the inverse where it joins multicast groups on interfaces that it has not joined on.

MulticastManager::process has logging that shows the multicast group, network interface, and address for joining multicast groups. Thus, the first thing to do is try 3.23 without explicitly specifying the MulticastInterface and multicast_interface and collect the logs to see if they meet expectations. (Make sure log_level >= LogLevel::Info.) Also, does communication occur in this scenario, i.e., you are just trying to understand/get rid of the warnings?

Presumably, the interface names will match what you reported earlier: lo, eth0, and sit0. These are names coming from NETLINK socket and those are names that must be used when configuring. Configuring the multicast interface to "eth0:avahi" when using the LinuxNetworkConfigMonitor means that all of the updates from netlink will get dropped because none of the interfaces have the appropriate name. See the call to exclude_from_multicast in MulticastManager.cpp. Since all of the updates are ignored, no multicast groups will be joined and therefore RTPS discovery cannot proceed. I believe this matches what you reported and you can confirm 1) by checking that no multicast groups were joined and 2) by taking a packet capture and observing a lack of SPDP announcements.

As a next experiment, you could configure the multicast interfaces to "eth0". Based on your report, the MulticastManager should attempt to join multicast groups on this interface with the APIPA address. Speaking of, when the APIPA address is being used, does the device have a valid route? That is, joining a multicast group without a valid route may cause the join to fail. Does communication work in this scenario? Does a packet capture show SPDP announcements?

Assuming that things still aren't working, I would turn my attention to Service_Participant.cpp. Instead of the LinuxNetworkConfigMonitor, you could use the NetworkConfigModifier of the DefaultNetworkConfigMonitor.

jmccabe commented 1 year ago

Thank you for reopening this, and for a detailed description of how it's supposed to work. I suspect I've been through some of the code when trying to debug the issue when we first saw it, but will take note of what you've said.

I will try to respond more fully to your questions tomorrow, but it's probably worth me reiterating that, on our system, eth0 does not have an IPV4 address; it's an embedded Linux application on ARM, built with the Xilinx Petalinux toolchain. The whole system is self-contained and devices only use "zeroconf" settings, i.e. mDNS and link-local IP addresses, hence the use of Avahi which, with the default setup using avahi-autoipd (I believe) causes us to get this eth0:avahi pseudo-device (AIUI) which is the one that has an IP address (in the link-local range) and which is capable of multicast. eth0, as I mentioned, has no IPV4 address which, as far as I can tell, causes it to be incapable of multicast (?).

As far as logs are concerned, I will check again, but the ones I looked at on 3.23 are basically the same as the ones I provided when we started seeing this on 3.14.

I will respond again tomorrow, but one thing I was wondering was whether we may be able to provide you with representative hardware this problem occurs on. I'm assuming that, as it's a rare use case, you'd probably like to understand it and find out if you can produce a fix for it but, as it's not, I guess, one of your primary supported platforms (AIUI) you may not want to spend money on it :-)

The other thing I did also wonder was whether a configuration option could be provided to disable the Linux Network Config Monitor which would avoid the issue without us having to go in and hack the code every time there's an update.

jmccabe commented 1 year ago

@jrw972 Here are the logs from the publisher and subscriber using the Messenger DevGuideExample (DCPS). The rtps.ini file is also included, but that's the default version anyway for now. These were both run on ARM A9 systems running PetaLinux (Xilinx Zynq-7000 systems) with the OpenDDS stuff built by following the Raspberry Pi example that is (was?) on your website, but using the Zynq-7000 toolchain. The DCPSDebugLevel and DCPSTransportDebugLevel were both set to 10 (let me know if there is a more appropriate setting for either of them).

As you can see from the logs, there was no communication. In a few minutes I'll send equivalent logs using OpenDDS built with the LinuxNetworkConfigMonitor disabled/commented out.

The output from ifconfig -a on one end of the link (the publisher, in this case) is:

eth0      Link encap:Ethernet  HWaddr 80:1F:12:F3:C2:EB
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:26844 errors:0 dropped:0 overruns:0 frame:0
          TX packets:25251 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:21287098 (20.3 MiB)  TX bytes:6458373 (6.1 MiB)
          Interrupt:31 Base address:0xb000

eth0:avahi Link encap:Ethernet  HWaddr 80:1F:12:F3:C2:EB
          inet addr:169.254.11.188  Bcast:169.254.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:31 Base address:0xb000

eth1      Link encap:Ethernet  HWaddr 80:1F:12:69:6E:23
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
          Interrupt:32 Base address:0xc000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1%1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:26 errors:0 dropped:0 overruns:0 frame:0
          TX packets:26 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1484 (1.4 KiB)  TX bytes:1484 (1.4 KiB)

sit0      Link encap:IPv6-in-IPv4
          NOARP  MTU:1480  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

The other end is basically the same, except there's no eth1 and the MAC/IP addresses are different, obviously.

defSubscriber.txt rtps.ini.txt defPublisher.txt

jmccabe commented 1 year ago

As mentioned, the logs from using OpenDDS 3.23 with the LinuxNetworkConfigMonitor disabled, clearly showing communication happening. modPublisher.txt modSubscriber.txt

jmccabe commented 1 year ago

In response to your question(s):

Thus, the first thing to do is try 3.23 without explicitly specifying the MulticastInterface and multicast_interface and collect the logs to see if they meet expectations. (Make sure log_level >= LogLevel::Info.) Also, does communication occur in this scenario, i.e., you are just trying to understand/get rid of the warnings?

My comment 2 above this one covers that; there's no communication; I'm not just trying to get rid of the warnings :-)

jmccabe commented 1 year ago

For this point:

Presumably, the interface names will match what you reported earlier: lo, eth0, and sit0. These are names coming from NETLINK socket and those are names that must be used when configuring. Configuring the multicast interface to "eth0:avahi" when using the LinuxNetworkConfigMonitor means that all of the updates from netlink will get dropped because none of the interfaces have the appropriate name. See the call to exclude_from_multicast in MulticastManager.cpp. Since all of the updates are ignored, no multicast groups will be joined and therefore RTPS discovery cannot proceed. I believe this matches what you reported and you can confirm 1) by checking that no multicast groups were joined and 2) by taking a packet capture and observing a lack of SPDP announcements.

Setting both multicast_interface=eth0:avahi and MulticastInterface=eth0:avahi results in the following logs and rtps packet capture (no communications). Note you need to ditch the gif extension on the pcapng file.

defPubEth0Avahi.txt defSubEth0Avahi.txt rtps.ini.Eth0Avahi.txt defCapEth0Avahi.pcapng.txt

jmccabe commented 1 year ago

On to:

As a next experiment, you could configure the multicast interfaces to "eth0". Based on your report, the MulticastManager should attempt to join multicast groups on this interface with the APIPA address. Speaking of, when the APIPA address is being used, does the device have a valid route? That is, joining a multicast group without a valid route may cause the join to fail. Does communication work in this scenario? Does a packet capture show SPDP announcements?

defPubEth0.txt defSubEth0.txt rtps.ini.Eth0.txt defCapEth0.pcapng.txt

jmccabe commented 1 year ago

As for this bit:

Assuming that things still aren't working, I would turn my attention to Service_Participant.cpp. Instead of the LinuxNetworkConfigMonitor, you could use the NetworkConfigModifier of the DefaultNetworkConfigMonitor.

o The NetworkConfigModifier was designed for users that want to manually control the multicast interfaces. For example, NETLINK is unavailable and changes in networking come in through a different channel. o The DefaultNetworkConfigMonitor uses defaults to essentially ignore changes in networking. It will cause the MulticastManager to join on all interfaces.

Just looking to see how I would do this; Service_Participant.cpp includes this code:

#if defined OPENDDS_LINUX_NETWORK_CONFIG_MONITOR
    if (DCPS_debug_level >= 1) {
      ACE_DEBUG((LM_DEBUG,
                 "(%P|%t) Service_Participant::get_domain_participant_factory: Creating LinuxNetworkConfigMonitor\n"));
    }
    network_config_monitor_ = make_rch<LinuxNetworkConfigMonitor>(reactor_task_.interceptor());
#elif defined(OPENDDS_NETWORK_CONFIG_MODIFIER)
    if (DCPS_debug_level >= 1) {
      ACE_DEBUG((LM_DEBUG,
                 "(%P|%t) Service_Participant::get_domain_participant_factory: Creating NetworkConfigModifier\n"));
    }
    network_config_monitor_ = make_rch<NetworkConfigModifier>();
#else
    if (DCPS_debug_level >= 1) {
      ACE_DEBUG((LM_DEBUG,
                 "(%P|%t) Service_Participant::get_domain_participant_factory: Creating DefaultNetworkConfigMonitor\n"));
    }
    network_config_monitor_ = make_rch<DefaultNetworkConfigMonitor>();
#endif

This is intriguing.

OPENDDS_LINUX_NETWORK_CONFIG_MONITOR is #defined LinuxNetworkConfigMonitor.h in this part:

#include "ace/config.h"

#if (defined(ACE_LINUX) || defined(ACE_ANDROID)) && !defined(OPENDDS_SAFETY_PROFILE)

#define OPENDDS_LINUX_NETWORK_CONFIG_MONITOR

The workaround @mitza-oci mentioned earlier is to comment that stuff out, i.e.

#include "ace/config.h"

//#if (defined(ACE_LINUX) || defined(ACE_ANDROID)) && !defined(OPENDDS_SAFETY_PROFILE)
#if 0

#define OPENDDS_LINUX_NETWORK_CONFIG_MONITOR

That's how OpenDDS was built for the modPublisher.txt/modSubscriber.txt logs attached earlier, which show:

(13712|13712) Service_Participant::get_domain_participant_factory: Creating DefaultNetworkConfigMonitor

Hence, it would appear that I don't have OPENDDS_NETWORK_CONFIG_MODIFIER defined since, if I had, I'd expect to see the output from here:

      ACE_DEBUG((LM_DEBUG,
                 "(%P|%t) Service_Participant::get_domain_participant_factory: Creating NetworkConfigModifier\n"));

OPENDDS_NETWORK_CONFIG_MODIFIER is defined in this bit of NetworkConfigModifier.h:

#include "ace/config.h"

// ACE_HAS_GETIFADDRS is not set on android but is available in API >= 24
#if ((!defined (ACE_LINUX) && defined(ACE_HAS_GETIFADDRS))  || (defined(ACE_ANDROID) && !defined ACE_LACKS_IF_NAMEINDEX)) && !defined(OPENDDS_SAFETY_PROFILE)

#define OPENDDS_NETWORK_CONFIG_MODIFIER

That seems a substantially more complex pre-processor directive than the one in LinuxNetworkConfigMonitor.h.

My hope is that we can get it to work with the LinuxNetworkConfigMonitor, preferably without specifying multicast interface explicitly. The key thing to understand is if joining the multicast group is succeeding with a warning or actually failing. Thus, a fix might be to change the logic to consider certain error codes as success. To test this, you can ignore the return value of multicast_socket.join in MulticastManager.cpp and always succeed. If you see SPDP announcements, then the error is erroneous (pun intended).

LOL 😜

jrw972 commented 1 year ago

Thanks for the logs. The packet captures did not attach so we may have to find a work around.

We can't rule out the spurious error yet. If you can get packet capture for the Messenger example or an equivalent setup, just look for multicast RTPS packets (probably SPDP announcements). The error prevents registering the handler for receiving, but it should not prevent sending. Thus, if the packets are still being sent, then the error is probably incorrect. If you don't want to go the packet capture route, just ignore the return value of multicast_socket.join in MulticastManager.cpp and register the input handler anyway. If you get data, it proves the error can be ignored.

Using eth0 instead of eth0:avahi was the correct thing to do, i.e., the LinuxNetworkConfigMonitor at least attempted to join the multicast group.

The Unknown error -5 is intriguing. Everything you have reported up to this point indicates that joining a multicast group specifically on eth0 fails with this error. However, joining a group on lo succeeds (from the Messenger logs). It is strange that the address reported for lo is 0.0.0.0 instead of 127.0.0.1. If possible, use a debugger to step into multicast_socket.join in MulticastManager.cpp and see where the error is coming from. Possibilities include:

  1. The OS is buggy, i.e., the system call was legit but it errored anyway. You may be able to see what is going on with strace if you have that available.
  2. The call into the operating system was buggy. In this case, ACE did not prepare or handle the arguments correctly. Again, strace may help here.
  3. There is a bug in ACE preventing the call. (strace will show no call into the OS).
  4. There is a bug in ACE that errors after a successful call. (strace will show a successful call.)
  5. There is general disagreement about interface names. That is, according to NETLINK, the Linux kernel says the interface eth0 has a ip address and can multicast. This name is supplied to ACE. Which presumably uses it to invoke the kernel. However, ACE or the kernel may choke on eth0. Looking at the ACE source code, it uses the interface name in a SIOCGIFADDR ioctl to get the IP for the interface. Conversely, if you don't specify the network interface, ACE uses get_ip_interfaces or getaddrinfo to loop through the interfaces. If there was inconsistency here, then I would expect it to not work when specifying multicast interfaces and work when not specifying them (by disabling the LinuxNetworkConfigMonitor) which is what you have demonstrated.

So, if you can trace it in a debugger, I think that's going to concretely point out the problem.

Since the learnings from that may or may not turn into something that is easily fixed, I'll discuss the possibility of enabling/disabling the LinuxNetworkConfigMonitor and NetworkConfigModifier via configuration.

jmccabe commented 1 year ago

Oh! I still have the packet capture files, so will try attaching them again in Monday. I gave them a gif extension so maybe github decided to check if it really was a gif!

Thanks for the other suggestions.

jmccabe commented 1 year ago

@jrw972 I've re-uploaded the packet captures, with a .txt extension this time (which means they haven't been put into the user-images space :-) ). Hopefully that will work better. I haven't had a chance to run the debugger with this yet, partly because I'd built OpenDDS for the target with --no-debug and --optimize, so don't know how useful it would've been :-) I've rebuilt it without those now, but will let you know if/when I get a chance to use them.