DDS Router won't communicate between networks

BenChung commented 5 months ago

I have a test setup of four containers; two are running the image router-base derived from the dockerfile

FROM ddsrouter
RUN apt update && apt install -y tcpdump

and the other two are derived from node-base,

FROM ros:rolling
RUN apt update && apt install -y tcpdump ros-rolling-demo-nodes-cpp

I then orchestrate them using the following Docker compose file:

version: "3.9"

networks:
  sideA:
    ipam:
      driver: default
      config:
        - subnet: "172.238.1.0/24"
  sideB:
    ipam:
      driver: default
      config:
        - subnet: "172.238.2.0/24"
services:
  node:
    image: node-base
    build:
      context: .
      dockerfile: node-base.dockerfile
    stdin_open: true
    tty: true
    profiles: ["run"]
  node-dev:
    extends: node
    volumes:
      # Mount the source code
      - ./dumps:/dumps
    command: 
            - /bin/bash
            - -c
            - |
              tcpdump -w /dumps/failure/client1.pcap &
              source /ros_entrypoint.sh && ros2 run demo_nodes_cpp listener
    environment:
      - ROS_DISCOVERY_SERVER=internal-router:11811
    profiles: ["good", "bad"]
    networks:
      sideA:
        ipv4_address: 172.238.1.3
  internal-router:
    image: ddsrouter-base
    build:
      context: .
      dockerfile: router-base.dockerfile
    command: 
            - /bin/bash
            - -c
            - |
              tcpdump -w /dumps/failure/router1.pcap &
              source ./install/setup.bash
              ddsrouter --config-path /config/config.yaml -d
    volumes:
      - ./router/:/config
      - ./dumps:/dumps
    ports:
      - 11188:11188/udp
    profiles: ["good", "bad"]
    networks:
      sideA:
        ipv4_address: 172.238.1.2
  node-dev2:
    extends: node
    volumes:
      - ./dumps:/dumps
    command: 
            - /bin/bash
            - -c
            - |
              tcpdump -w /dumps/failure/client2.pcap &
              sleep 5
              source /ros_entrypoint.sh && ros2 run demo_nodes_cpp talker
    environment:
      - ROS_DISCOVERY_SERVER=router2:11811
    profiles: ["good"]
    networks:
      sideA:
        ipv4_address: 172.238.1.13
  router2:
    image: ddsrouter-base
    command: 
            - /bin/bash
            - -c
            - |
              tcpdump -w /dumps/failure/router2.pcap &
              source ./install/setup.bash
              ddsrouter --config-path /config/config2.yaml -d
    volumes:
      - ./router/:/config
      - ./dumps:/dumps
    ports:
      - 30002:30002/tcp
      - 11166:11166/tcp
    profiles: ["good"]
    networks:
      sideA:
        ipv4_address: 172.238.1.12
  node-dev2-bad:
    extends: node
    volumes:
      - ./dumps:/dumps
    command: 
            - /bin/bash
            - -c
            - |
              tcpdump -w /dumps/failure/client2.pcap &
              sleep 5
              source /ros_entrypoint.sh && ros2 run demo_nodes_cpp talker
    environment:
      - ROS_DISCOVERY_SERVER=router2:11811
    profiles: ["bad"]
    networks:
      sideB:
        ipv4_address: 172.238.2.3
  router2-bad:
    image: ddsrouter-base
    command: 
            - /bin/bash
            - -c
            - |
              tcpdump -w /dumps/failure/router2.pcap &
              source ./install/setup.bash
              ddsrouter --config-path /config/config2.yaml -d
    volumes:
      - ./router/:/config
      - ./dumps:/dumps
    ports:
      - 30002:30002/tcp
      - 11166:11166/tcp
    profiles: ["bad"]
    networks:
      sideB:
        ipv4_address: 172.238.2.2
        aliases:
          - router2

using configs

config.yaml:

version: v4.0
specs:
  discovery-trigger: any
participants:
  - name: LocalDiscoveryServer
    kind: local-discovery-server
    discovery-server-guid:
      ros-discovery-server: true
      id: 0
    listening-addresses:
      - ip: 0.0.0.0
        port: 11811
        transport: udp
  - name: LocalWAN
    kind: wan
    connection-addresses:
      - domain: host.docker.internal        # Public IP of sever
        port: 11166                     # server port
        transport: tcp                  # Transport protocol - tcp so that we don't need a back IP addy
  - name: EchoParticipant                                             # 6
    kind: echo                                                        # 7
    discovery: true                                                   # 8
    data: true                                                        # 9
    verbose: true                                                     # 10

and config2.yaml

version: v4.0
specs:
  discovery-trigger: any
participants:
  - name: LocalDiscoveryServer2
    kind: local-discovery-server
    discovery-server-guid:
      ros-discovery-server: true
      id: 0
    listening-addresses:
      - ip: 0.0.0.0     
        port: 11811
        transport: udp

  - name: LocalWAN2
    kind: wan
    listening-addresses:
      - domain: 0.0.0.0        # Public IP of sever
        port: 11166                     # server port
        transport: tcp                  # Transport protocol - tcp so that we don't need a back IP addy

  - name: EchoParticipant                                             # 6
    kind: echo                                                        # 7
    discovery: true                                                   # 8
    data: true                                                        # 9
    verbose: true                                                     # 10

If I bring the ensemble up with docker compose --profile good up, everything works:

ros-flyer-node-dev2-1        | [INFO] [1710402041.253293400] [talker]: Publishing: 'Hello World: 6'
ros-flyer-router2-1          | In Endpoint: 01.0f.e0.c2.39.00.f3.57.00.00.00.00|0.0.3.3 from Participant: LocalDiscoveryServer2 in topic: rt/rosout payload received: Payload{00 01 00 00 f9 a9 f2 65 58 f3 18 0f 14 00 00 00 07 00 00 00 74 61 6c 6b 65 72 00 00 1d 00 00 00 50 75 62 6c 69 73 68 69 6e 67 3a 20 27 48 65 6c 6c 6f 20 57 6f 72 6c 64 3a 20 36 27 00 00 00 00 18 00 00 00 2e 2f 73 72 63 2f 74 6f 70 69 63 73 2f 74 61 6c 6b 65 72 2e 63 70 70 00 0b 00 00 00 6f 70 65 72 61 74 6f 72 28 29 00 00 2f 00 00 00} with specific qos: SpecificEndpointQoS{Partitions{};OwnershipStrength{0}}.
ros-flyer-router2-1          | In Endpoint: 01.0f.e0.c2.39.00.f3.57.00.00.00.00|0.0.14.3 from Participant: LocalDiscoveryServer2 in topic: rt/chatter payload received: Payload{00 
01 00 00 0f 00 00 00 48 65 6c 6c 6f 20 57 6f 72 6c 64 3a 20 36 00 00} with specific qos: SpecificEndpointQoS{Partitions{};OwnershipStrength{0}}.
ros-flyer-internal-router-1  | In Endpoint: 01.0f.45.64.01.00.9f.e7.00.00.00.00|0.0.23.3 from Participant: LocalWAN in topic: rt/chatter payload received: Payload{00 01 00 00 0f 00 00 00 48 65 6c 6c 6f 20 57 6f 72 6c 64 3a 20 36 00 00} with specific qos: SpecificEndpointQoS{Partitions{};OwnershipStrength{0}}.
ros-flyer-node-dev-1         | [INFO] [1710402041.255290700] [listener]: I heard: [Hello World: 6]

but if I bring it up with the other router and client on a different virtual network netB using docker compose --profile bad up, then it doesn't work:

ros-flyer-node-dev2-bad-1    | [INFO] [1710402141.088181800] [talker]: Publishing: 'Hello World: 2'
ros-flyer-router2-bad-1      | In Endpoint: In Endpoint: 01.0f.35.db.39.00.9f.a6.00.00.00.00|01.0f.35.db.39.00.9f.a6.00.00.00.00|0.0.3.3 from Participant: LocalDiscoveryServer2 in topic: rt/rosout0.0.14.3 payload received: Payload{00 01 from Participant:  00 00 5d LocalDiscoveryServer2aa f2 65 28  in topic: rt/chatter payload received: 8cPayload{ 41 05 1400 01 00  00 0f 00 0000  0000  4800 07 00 00  6500  6c74  6c 6f 20 57 6f61  6c 6b72 65 72 00 00 1d 00 000 6c 64 3a 20  0032  5000  75 62 6c 69 73000 }68 69 6e 67 with specific qos:  SpecificEndpointQoS{Partitions{}3a ;20 27 48 65 6c 6cOwnershipStrength{ 6f 20 57 6f0 }}.
ros-flyer-router2-bad-1      | 72 6c 64 3a 20 32 27 00 00 00 00 18 00 00 00 2e 2f 73 72 63 2f 74 6f 70 69 63 73 2f 74 61 6c 6b 65 72 2e 63 70 70 00 0b 00 00 00 6f 70 65 72 61 74 6f 72 28 29 00 00 2f 00 00 00} with specific qos: SpecificEndpointQoS{Partitions{};OwnershipStrength{0}}.

The tcpdump data that's generated shows that in both cases the routers are regularly communicating via TCP in patterns that are very similar. However, they don't appear to be cross-publishing the talker messages and thus when on different networks the clients aren't able to communicate.

BenChung commented 5 months ago

Okay, I figured out a slice of the problem. The issue is that the initial peers discovery method is first used to handshake (successfully) between the two DDS Router instances at which point the server's connection-addresses locator is used for communication instead of the initial peers domain. I'd really prefer to have the initial peers value be used as the locator after discovery rather than the discovered locator since the server may be reachable through a variety of interfaces (for example, it may be available on different IPs inside of the subnet as well as on an externally-facing IP visible to the wider internet).

juanlofer-eprosima commented 5 months ago

Hi @BenChung ,

I am not exactly sure what your use case is, and so why are you using this configuration setup. However, I'm gonna point to a few things I find odd and hopefully that might shed some light on the matter.

Be careful with discovery-trigger: any option, this might result in endpoints not properly matching due to QoS incompatibilities. I suggest to use the default value (discovery-trigger: reader).
I suggest getting rid of local discovery server participants and use simple ones instead (if multicast is available in your setting), just to simplify the scenario.
When using domain tag a DNS domain is expected, not an IP. I don't know if this might be generating issues (it could actually be treated as an IP due to implementation details, I'd need to verify).
We never use 0.0.0.0 IPs in our configurations. It might actually work, but as I said it's not tested from our side. I suggest to benefit from Docker compose DNS service and set domains to be service names.

Regards

BenChung commented 5 months ago

Hi, and thank you for the help! I was trying the discovery-trigger: any option as a "sticks against the wall" debugging approach.

The issue that was proximally keeping this from working was the 0.0.0.0 IPs. I'd really like one side of this (call it the "server side") to use 0.0.0.0 or similar IP so that it doesn't have to be aware of the ingress approach. It's available under several different ports, IPs, and domain names in the ultimate configuration, and it would be nice if we didn't have to nail that down to a finite list.

As far as I can tell, what happens right now is that the WAN participant instances with one set to 0.0.0.0 will start communicating under initial peers.... but once discovery(?) information has been exchanged the other side will use the domain or IP provided in the locator provided by the other side. In the case of a 0.0.0.0 IP, this defaults to being the system's interface addresses, which really doesn't work in my setup. What I'd like to do is have the WAN participants continue communicating over the connection (IP/domain and port) as originally specified in the connect-or's configuration. This then allows me to set up the "overall" server to be ignorant of how it's connected to (k8s ingress, direct pod to pod addressing, a proxy, etc).

I can make a more specific bug report or feature request along these lines, but I suspect that what I describe is sufficiently alien to the locator model that it's hard to realize.

eProsima / DDS-Router

DDS Router won't communicate between networks #439