RobotLocomotion / drake-ros

Experimental prototyping (for now)
Apache License 2.0
83 stars 32 forks source link

ROS_DOMAIN_ID, ROS_LOCALHOST_ONLY: Unexpected behaviors? #103

Open EricCousineau-TRI opened 2 years ago

EricCousineau-TRI commented 2 years ago

I can reproduce Shane's results from #98 (but replacing bazel run //ros2_example_apps:oracle_cc with bazel-bin/ros2_example_apps/oracle_cc). However, I have run into odd edge cases.

First, I confirm that using ros2 multicast send and ros2 multicast receive indicate a working UDP multicast route between two machines in both directions. (see #104 for debugging) I also confirm that I have no ROS variables set in my nominal environment (env | grep ROS is empty).

To build, I use 8c20da4, and run:

cd ros2_example_bazel_installed
bazel build //ros2_example_apps:oracle_cc @ros2//:ros2

For checking crosstalk, I see if ros2 node list indicates that /oracle is present, as Shane did.

Concerns

  1. Why does ROS_DOMAIN_ID=0 not work, locally or remotely, for ROS_LOCALHOST_ONLY=0? This contrasts with documentation that seems to indicate ROS_DOMAIN_ID=0 is valid: https://docs.ros.org/en/humble/Concepts/About-Domain-ID.html#choosing-a-domain-id-short-version
  2. Why do I still receive on a machine with ROS_LOCALHOST_ONLY=1?

I don't believe we need to solve these now, but we should solve them ideally within next month.

This is painful to test manually, so I'd recommend we use Python + subprocess + ssh to automatically the following checks:

FYI @sloretz @IanTheEngineer

sloretz commented 2 years ago

I think the results can be explained by the ros2 CLI reusing a daemon despite a different ROS_LOCALHOST_ONLY setting. I'll look more closely at ros2cli to see if that's the case.

Why does ROS_DOMAIN_ID=0 not work, locally or remotely, for ROS_LOCALHOST_ONLY=0? This contrasts with documentation that seems to indicate ROS_DOMAIN_ID=0 is valid: https://docs.ros.org/en/humble/Concepts/About-Domain-ID.html#choosing-a-domain-id-short-version

ROS_DOMAIN_ID=0 is definitely valid.

I'm having trouble following the formatting, so I'll try giving the tests numbers. Do I understand correctly that in each test you run oracle_cc on one machine, and ros2 node list on another? Were the tests run in the order listed?

[Test 1] By default, I do not see cross-talk traffic:

What I think is happening:

Assuming this test was run first, the referenced commit does set ROS_LOCALHOST_ONLY=1 by default. Assuming no other daemon was running, the ros2 node list created one with ROS_LOCALHOST_ONLY=1 and ROS_DOMAIN_ID=0. I believe there is one daemon created per ROS_DOMAIN_ID, but from your test results, I suspect that it doesn't create a new daemon when ROS_LOCALHOST_ONLY changes.

[Test 2] Using ROS_LOCALHOST_ONLY=1:

What I think is happening:

Same as the first test, the expected result is no inter-machine communication

[Test 3] Using ROS_LOCALHOST_ONLY=0:

What I think is happening:

I think the intuitive result would be inter-machine communication happens, but if the ros2 node list command reused the daemon then it would explain why you saw no inter-machine communication.

[Test 4] On top of that, I don't see local traffic when using ROS_LOCALHOST_ONLY=1 with ROS_DOMAIN_ID=0. Unclear why.

I'm not sure what commands were run for this one.

[Test 5] Using ROS_LOCALHOST_ONLY=0: [...] I see cross-talk traffic if ROS_DOMAIN_ID=1 ROS_DOMAIN_ID=1

What I think is happening:

The ros2 CLI will definitely make a new daemon here, so inter machine communication here is expected.

EricCousineau-TRI commented 2 years ago

Thanks! Yes, that matches what we have experienced. For point (1), this can be concretely reproduced with something like:

bazel-bin/external/ros2/ros2 daemon stop
env ROS_LOCALHOST_ONLY=0 bazel-bin/external/ros2/ros2 daemon start
# then run normal commands, with implied or explicit ROS_LOCALHOST_ONLY=1, and see the discrepancy appear

Two suggestions for improvement:

EricCousineau-TRI commented 2 years ago

Ah, ~one~ three additional suggestions:

\cc @cottsay

sloretz commented 2 years ago

Add a ros2 --no-daemon [...] option for additional debugging (beyond stopping / restarting / hash checks), with documentation stating that it will be slower

This one exists, but in a different place. Commands that use the daemon offer a --no-daemon option and a --spin-time option which says how many seconds to wait for discovery

ros2 node list --no-daemon --spin-time 1
$ ros2 node list --help
[...]
  --spin-time SPIN_TIME
                        Spin time in seconds to wait for discovery (only applies when not using an already running daemon)
[...]
  --no-daemon           Do not spawn nor use an already running daemon
EricCousineau-TRI commented 2 years ago

Gotcha! Is it easy to tell which commands need it? (and how many?)

A (dumb?) suggestion is to hoist the daemon arguments to top-level, even if unused; then users can easily know they're disabling it with an alias / wrapper for ros2.


Side note: Does our rmw_isolation still correctly isolate, even if a daemon is invoked? Concretely, we do things to DDS that is not expressible by ROS_DOMAIN_ID, but the daemon seems to get located by the domain itself?

EricCousineau-TRI commented 1 year ago

Related to #99, it may be good for users of Ubuntu (and other systems of similar config?) to use the startup script as Shane illustrated in https://github.com/eclipse-cyclonedds/cyclonedds/issues/1400