Open EricCousineau-TRI opened 2 years ago
I think the results can be explained by the ros2
CLI reusing a daemon despite a different ROS_LOCALHOST_ONLY
setting. I'll look more closely at ros2cli
to see if that's the case.
Why does ROS_DOMAIN_ID=0 not work, locally or remotely, for ROS_LOCALHOST_ONLY=0? This contrasts with documentation that seems to indicate ROS_DOMAIN_ID=0 is valid: https://docs.ros.org/en/humble/Concepts/About-Domain-ID.html#choosing-a-domain-id-short-version
ROS_DOMAIN_ID=0
is definitely valid.
I'm having trouble following the formatting, so I'll try giving the tests numbers. Do I understand correctly that in each test you run oracle_cc
on one machine, and ros2 node list
on another? Were the tests run in the order listed?
[Test 1] By default, I do not see cross-talk traffic:
What I think is happening:
Assuming this test was run first, the referenced commit does set ROS_LOCALHOST_ONLY=1
by default.
Assuming no other daemon was running, the ros2 node list
created one with ROS_LOCALHOST_ONLY=1
and ROS_DOMAIN_ID=0
. I believe there is one daemon created per ROS_DOMAIN_ID
, but from your test results, I suspect that it doesn't create a new daemon when ROS_LOCALHOST_ONLY
changes.
[Test 2] Using ROS_LOCALHOST_ONLY=1:
What I think is happening:
Same as the first test, the expected result is no inter-machine communication
[Test 3] Using ROS_LOCALHOST_ONLY=0:
What I think is happening:
I think the intuitive result would be inter-machine communication happens, but if the ros2 node list
command reused the daemon then it would explain why you saw no inter-machine communication.
[Test 4] On top of that, I don't see local traffic when using ROS_LOCALHOST_ONLY=1 with ROS_DOMAIN_ID=0. Unclear why.
I'm not sure what commands were run for this one.
[Test 5] Using ROS_LOCALHOST_ONLY=0: [...] I see cross-talk traffic if ROS_DOMAIN_ID=1 ROS_DOMAIN_ID=1
What I think is happening:
The ros2
CLI will definitely make a new daemon here, so inter machine communication here is expected.
Thanks! Yes, that matches what we have experienced. For point (1), this can be concretely reproduced with something like:
bazel-bin/external/ros2/ros2 daemon stop
env ROS_LOCALHOST_ONLY=0 bazel-bin/external/ros2/ros2 daemon start
# then run normal commands, with implied or explicit ROS_LOCALHOST_ONLY=1, and see the discrepancy appear
Two suggestions for improvement:
If it does not exist already, there should be a (or many?) warning(s) about this behavior when mixing invocations; I do not see it presently in docs (link). Basically, ROS_LOCALHOST_ONLY
in one bash session could mean nothing depending on how the daemon was launched
Ideally, there should be a quick hash based on networking for the daemon; clients should die or complain loudly if their candidate daemon has a different hash. More ideally, there should be a verbose indication of what the differences are (e.g. "These key environment variables are different [...]")
Ah, ~one~ three additional suggestions:
ros2 --no-daemon [...]
option for additional debugging (beyond stopping / restarting / hash checks), with documentation stating that it will be slowerros2 daemon
to output of ros2 doctor --report
ros2 daemon status
\cc @cottsay
Add a ros2 --no-daemon [...] option for additional debugging (beyond stopping / restarting / hash checks), with documentation stating that it will be slower
This one exists, but in a different place. Commands that use the daemon offer a --no-daemon
option and a --spin-time
option which says how many seconds to wait for discovery
ros2 node list --no-daemon --spin-time 1
$ ros2 node list --help
[...]
--spin-time SPIN_TIME
Spin time in seconds to wait for discovery (only applies when not using an already running daemon)
[...]
--no-daemon Do not spawn nor use an already running daemon
Gotcha! Is it easy to tell which commands need it? (and how many?)
A (dumb?) suggestion is to hoist the daemon arguments to top-level, even if unused; then users can easily know they're disabling it with an alias / wrapper for ros2
.
Side note: Does our rmw_isolation
still correctly isolate, even if a daemon is invoked?
Concretely, we do things to DDS that is not expressible by ROS_DOMAIN_ID
, but the daemon seems to get located by the domain itself?
Related to #99, it may be good for users of Ubuntu (and other systems of similar config?) to use the startup script as Shane illustrated in https://github.com/eclipse-cyclonedds/cyclonedds/issues/1400
I can reproduce Shane's results from #98 (but replacing
bazel run //ros2_example_apps:oracle_cc
withbazel-bin/ros2_example_apps/oracle_cc
). However, I have run into odd edge cases.First, I confirm that using
ros2 multicast send
andros2 multicast receive
indicate a working UDP multicast route between two machines in both directions. (see #104 for debugging) I also confirm that I have no ROS variables set in my nominal environment (env | grep ROS
is empty).To build, I use 8c20da4, and run:
For checking crosstalk, I see if
ros2 node list
indicates that/oracle
is present, as Shane did.bazel-bin/ros2_example_apps/oracle_cc
Receiver:bazel-bin/external/ros2/ros2 node list
ROS_LOCALHOST_ONLY=1
:ROS_LOCALHOST_ONLY=0
:ROS_DOMAIN_ID
is unset or set toROS_DOMAIN_ID=0
. Concretely: Sender:env ROS_DOMAIN_ID=0 ROS_LOCALHOST_ONLY=0 bazel-bin/ros2_example_apps/oracle_cc
Receiver:env ROS_DOMAIN_ID=0 ROS_LOCALHOST_ONLY=0 bazel-bin/external/ros2/ros2 node list
-and- Sender:env ROS_LOCALHOST_ONLY=0 bazel-bin/ros2_example_apps/oracle_cc
Receiver:env ROS_LOCALHOST_ONLY=0 bazel-bin/external/ros2/ros2 node list
ROS_LOCALHOST_ONLY=1
withROS_DOMAIN_ID=0
. Unclear why.ROS_DOMAIN_ID=1
Sender:env ROS_DOMAIN_ID=1 ROS_LOCALHOST_ONLY=0 bazel-bin/ros2_example_apps/oracle_cc
Receiver:env ROS_DOMAIN_ID=1 ROS_LOCALHOST_ONLY=0 bazel-bin/external/ros2/ros2 node list
ROS_LOCALHOST_ONLY=1
: Sender:env ROS_DOMAIN_ID=1 ROS_LOCALHOST_ONLY=0 bazel-bin/ros2_example_apps/oracle_cc
Receiver:env ROS_DOMAIN_ID=1 ROS_LOCALHOST_ONLY=1 bazel-bin/external/ros2/ros2 node list
Concerns
ROS_DOMAIN_ID=0
not work, locally or remotely, forROS_LOCALHOST_ONLY=0
? This contrasts with documentation that seems to indicateROS_DOMAIN_ID=0
is valid: https://docs.ros.org/en/humble/Concepts/About-Domain-ID.html#choosing-a-domain-id-short-versionROS_LOCALHOST_ONLY=1
?I don't believe we need to solve these now, but we should solve them ideally within next month.
This is painful to test manually, so I'd recommend we use Python +
subprocess
+ssh
to automatically the following checks:sender={local,remote}
,receiver={local,remote}
ROS_DOMAIN_ID={unset,0,1}
,ROS_LOCALHOST_ONLY={unset,0,1}
FYI @sloretz @IanTheEngineer