Closed RPRKLR closed 1 month ago
This sounds similar to https://github.com/FlexBE/flexbe_app/issues/100 that you reported, so you are seeing the issue with both app and webui.
Clearly unacceptable, but I can't think of FlexBE reason for this.
I suspect we have issue with DDS discovery as it is bringing the messages online.
Can I ask you to do two things.
1) Try the DDS discovery server https://fast-dds.docs.eprosima.com/en/latest/fastdds/ros2/discovery_server/ros2_discovery_server.html
IIUC, this basically reverts us back to ROS 1 mode with a centralized roscore. This is probably not the way you want to run, but just to help us debug.
2) Try a different DDS implementation. https://docs.ros.org/en/iron/Installation/DDS-Implementations.html. Again, there may be good reasons for you not to switch your system over, but just to eliminate some issues during testing.
Questions:
0) What DDS setup are you using, and what is your computing and networking setup (wired vs. wireless, number of machines, local network in lab or part of larger building)?
1) with the image above, is that the point it hangs? Can you then capture an image showing the messages after the timeout is elapsed and it continues? I'd like to get a better understanding of exactly were it is hanging up.
2) While this delay is going on, can you try doing some other ros2 related DDS task like "ros2 topic echo" of some unrelated topic? (Maybe, but not necessarily, that would point to DDS issue).
Separate from the timing issues, there is a UI issue
Observe that the WebUI indicates that it starts the behavior, but then jumps back to the starting screen, allowing the start execution button to be pressed again.
I am noting that and will look into that while I wait on feedback from networking test.
We were running the standard DDS setup which comes with ROS2
Try the DDS discovery server
After setting up the DDS server, the startup time was the same, if not more. I have also noticed another issue, that flexbe was complaining in the terminal, the message was the following:
[WARN] Topic test for state None not yet available.
it does not do it for every topic just a few. I have also tried it with and without super client, because I was suspecting that maybe flexbe can not see the topics. Both with and without super client I got the same warning messages.
Try a different DDS implementation.
I'm yet to try out other type of DDS, I will try to set up a test for you as soon as I have the time.
What DDS setup are you using, and what is your computing and networking setup (wired vs. wireless, number of machines, local network in lab or part of larger building)?
The default Fastrtps, network is wireless, maximum 4 machines, in a lab.
with the image above, is that the point it hangs? Can you then capture an image showing the messages after the timeout is elapsed and it continues? I'd like to get a better understanding of exactly were it is hanging up.
It does not actually hang for a long time, when starting the behavior, it takes around 40 seconds until I get some kind of a terminal message, and then it is not actually freezing, I am constantly getting the messages for every state which is being created. It just takes a lot of time, until it finishes the building of the behavior.
While this delay is going on, can you try doing some other ros2 related DDS task like "ros2 topic echo" of some unrelated topic? (Maybe, but not necessarily, that would point to DDS issue).
I tried to do topic list and echo, and they are working while it is building the behavior.
I have one question too, when it finishes the building of the behavior/state is it normal if I am getting a message like 1450 entries and 1 collision? What does the collision mean?
last question first. The system uses a "state_id" assigned by hashing the path to state in the behavior. There is a simple method for resolving collisions, and I keep track of the number of collisions that are processed as a debugging message. So the message you are referring to just say it processed 1450 states and only had to resolve one collision. If it fails to resolve the collision it throws an error, so the message you are seeing is perfectly normal. I'll consider rewording this message to make that clearer.
At this point, I've always logged the messages to console as each state is processed. With 1450 states, the IO does add a non-trival amount of processing time. I'll look into making this optional, or just eliminate. But that's still on the order of seconds of savings, not minutes. With regard to the "Service not available" timeouts. Normally, I write states that use 0 wait timeout in "on_start", and then have a wait for connection in on_enter to avoid these start up delays. If you are establishing service/action connections in on_start and assigning a wait time there it could account for some delay.
I am also having another issue while using FastDDS Discovery server, I have a topic with these QoS settings:
self._qos = QoSProfile(depth = 1,
history = QoSHistoryPolicy.KEEP_LAST,
reliability = QoSReliabilityPolicy.RELIABLE,
durability = QoSDurabilityPolicy.TRANSIENT_LOCAL)
I can echo the topic and I am getting back the value, but when I create a state which is subscribing to the topic, I am getting a warning message which says:
Topic /test_topic/test for state None not yet available.
I can see the topic, and I can echo it, and both the subscriber and the publisher are having the same QoS settings.
I have made some discoveries since my last comment, the issue is that if I am using the FastDDS Discovery Server, I am creating a topic type and name check with the help of
self._node.get_topic_names_and_types()
But it only returns the flexbe related topics like /flexbe/hearthbeat
and /flexbe/log
and the /rosout
. I also tried to get the topics with the super client, but it still does not work. For now I don't know if it is just an issue while using flexbe or is it actually a ros2 problem.
What is your network configuration? One local machine, or are you running get_topic_names_and_types
on a different computer?
What is your network configuration? One local machine, or are you running
get_topic_names_and_types
on a different computer?
First I tried to run the onboard on the robot and ran the behavior on another computer, and then I tried to run both the onboard and the behavior on the robot, I got the same results.
I communicate with the robot with WiFi, and I am using a SUPER_CLIENT
, so I can list and echo the topics.
I have opened an issue and got a temporary solution, which is to spin the node one more time before calling the function. With the basic node it works but I have not had the chance yet, to try it out with flexbe, that is what I am going to do now.
I have tried to run the flexbe code by spinning one more time before running the function. It breaks the flexbe, and it deadlocked after the first state.
Good news is that I read the FastDDS documentation one more time, and implemented the SUPER_CLIENT
in a different way, and now I don't have to spin the node one more time inside the flexbe code, on other nodes where I don't use flexbe, I have to spin it. After setting up the SUPER_CLIENT
the startup of the big flexbe behavior is faster, now it is 2 mins and 20 seconds. It is faster but still not fast enough. This was tested with behavior engine version 3.0. Today I am planning to also test it with the version 4.0
After updating the flexbe behavior engine to 4.0.0, the big behavior startup time decreased a lot, good job on the new release. Now if I start the onboard engine, and then start the ros2 run flexbe_widget be_launcher -b BigBehavior
, it starts in 1 min and 12 seconds without using the FastDDS Discovery Server. With the Discovery Server it starts in 48 seconds.
It's still a lot compared to ROS1 startup time, but it is a great improvement. Good job!
A suggestion for anyone seeing this later
__init__
or on_start
with a 0.0 timeout, then check status in on_enter
with some timeout.
When using FlexBE with ROS1, the execution of a large behavior (with nested concurrency containers and numerous behaviors) starts within seconds, typically within 5 seconds. However, with ROS2 Iron and the Flexbe WebUI, the same behavior takes more than 5 minutes to start.
Steps to Reproduce:
Expected Behavior:
The behavior execution should start within seconds, similar to the performance observed with ROS1.
Actual Behavior:
The behavior execution takes more than 5 minutes to start, with the WebUI incorrectly indicating that the behavior can be started again, and significant delays in logger messages and state building.
Picture of the high amount of building state entries
Environment:
ROS Version: ROS2 Iron FlexBE Engine Version: 3.0.4 WebUI Version: beta Operating System: Ubuntu 22.04